Nonextensive  Entropic  Kernels 

Andre  F.  T.  Martins,  Mario  A.  T.  Figueiredo 
Pedro  M.  Q.  Aguiar,  Noah  Smith,  Eric  P.  Xing 

August  2008 
CMU-ML-08-106 


Report  Documentation  Page 


Form  Approved 
0MB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 


1.  REPORT  DATE 

AUG  2008 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2008  to  00-00-2008 


4.  TITLE  AND  SUBTITLE 

Nonextensive  Entropic  Kernels 


6.  AUTHOR(S) 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


7.  PEREORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Carnegie  Mellon  University, School  of  Computer  Science, Machine 
Learning  Department, Pittsburgh, PA, 15213 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

Positive  definite  kernels  on  probability  measures  have  been  recently  applied  in  classification  problems 
involving  text,  images,  and  other  types  of  structured  data.  Some  of  these  kernels  are  related  to  classic 
information  theoretic  quantities,  such  as  (Shannon?s)  mutual  information  and  the  Jensen-Shannon  (JS) 
divergence.  Meanwhile,  there  have  been  recent  advances  in  nonextensive  generalizations  of  Shannon?s 
information  theory.  This  paper  bridges  these  two  trends  by  introducing  nonextensive  information  theoretic 
kernels  on  probability  measures,  based  on  new  JS-type  divergences.  These  new  divergences  result  from 
extending  the  the  two  building  blocks  of  the  classical  JS  divergence:  convexity  and  Shannon?s  entropy.  The 
classical  notion  of  convexity  is  extended  to  the  wider  concept  of  q-convexity,  for  which  we  prove  a  Jensen 
q-inequality.  Based  on  this  inequality,  we  introduce  Jensen-Tsallis  (JT)  q-differences,  a  nonextensive 
generalization  of  the  JS  divergence,  and  define  a  k-th  order  JT  q-difference  between  stochastic  processes. 
We  then  define  a  new  family  of  nonextensive  mutual  information  kernels,  which  allow  weights  to  be 
assigned  to  their  arguments,  and  which  includes  the  Boolean,  JS,  and  linear  kernels  as  particular  cases. 
Nonextensive  string  kernels  are  also  defined  that  subsume  the  p-spectrum  kernel.  We  illustrate  the 
performance  of  these  kernels  on  text  categorization  tasks,  in  which  documents  are  modeled  both  as 
bags-of-words  and  as  sequences  of  characters. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OE 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

50 

Nonextensive  Entropic  Kernels 

Andre  F.  T.  Martins^^  Mario  A.  T.  Figueiredo^ 
Pedro  M.  Q.  Aguiar^^  Noah  A.  Smith^ 
Eric  P.  Xing^ 

August  2008 
CMU-ML-08-106 


School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh,  PA  15213 


^School  of  Computer  Science,  Carnegie  Mellon  University,  Pittsburgh,  PA,  USA, 
llnstituto  de  Telecomunica96es  /  f*Instituto  de  Sistemas  e  Robotica,  Institute  Superior  Tecnico, 
Lisboa,  Portugal 


This  work  was  partially  supported  by  Fundagao  para  a  Ciencia  e  Tecnologia  (FCT),  Portugal,  grant  PTDC/EEA- 
TEL/72572/2006  and  by  the  European  Commission  under  project  SIMBAD.  A.M.  was  supported  by  a  grant  from  ECT 
through  the  CMU-Portugal  Program  and  the  Information  and  Communications  Technologies  Institute  (ICTI)  at  CMU. 
N.S.  was  supported  by  NSE  IIS-0713265  and  DARPA  HROOl  101 10013.  E.X.  was  supported  by  NSE  DBI-0546594, 
DBI-0640543,  and  IIS-0713379. 


Keywords:  Positive  definite  kernels,  nonextensive  information  theory,  Tsallis  entropy,  Jensen- 
Shannon  divergenee,  string  kernels. 


Abstract 

Positive  definite  kernels  on  probability  measures  have  been  recently  applied  in  classification  prob¬ 
lems  involving  text,  images,  and  other  types  of  structured  data.  Some  of  these  kernels  are  related 
to  classic  information  theoretic  quantities,  such  as  (Shannon’s)  mutual  information  and  the  Jensen- 
Shannon  (JS)  divergence.  Meanwhile,  there  have  been  recent  advances  in  nonextensive  gener¬ 
alizations  of  Shannon’s  information  theory.  This  paper  bridges  these  two  trends  by  introducing 
nonextensive  information  theoretic  kernels  on  probability  measures,  based  on  new  JS-type  diver¬ 
gences.  These  new  divergences  result  from  extending  the  the  two  building  blocks  of  the  classical 
JS  divergence:  convexity  and  Shannon’s  entropy.  The  classical  notion  of  convexity  is  extended 
to  the  wider  concept  of  g-convexity,  for  which  we  prove  a  Jensen  g-inequality.  Based  on  this  in¬ 
equality,  we  introduce  Jensen-Tsallis  (JT)  g-differences,  a  nonextensive  generalization  of  the  JS 
divergence,  and  define  a  /c-th  order  JT  g-difference  between  stochastic  processes.  We  then  define 
a  new  family  of  nonextensive  mutual  information  kernels,  which  allow  weights  to  be  assigned 
to  their  arguments,  and  which  includes  the  Boolean,  JS,  and  linear  kernels  as  particular  cases. 
Nonextensive  string  kernels  are  also  defined  that  subsume  the  p-spectrum  kernel.  We  illustrate  the 
performance  of  these  kernels  on  text  categorization  tasks,  in  which  documents  are  modeled  both 
as  bags-of-words  and  as  sequences  of  characters. 


1  Introduction 


In  kernel-based  machine  learning  [Scholkopf  and  Smola,  2002,  Shawe-Taylor  and  Cristianini, 

2004] ,  there  has  been  recent  interest  in  defining  kernels  on  probability  distributions,  to  tackle 
several  problems  involving  structured  data  [Desobry  et  ah,  2007,  Moreno  et  ah,  2004,  Jebara  et  ah, 
2004,  Hein  and  Bousquet,  2005,  Lafferty  and  Lebanon,  2005,  Cuturi  et  ah,  2005].  By  defining  a 
parametric  family  S  containing  the  distributions  from  which  the  data  points  (in  the  input  space  X) 
are  assumed  to  have  been  generated,  and  defining  a  map  from  X  from  S  (e.g.,  through  maximum 
likelihood  estimation),  a  distribution  in  S  may  be  fitted  to  each  datum.  Therefore,  a  kernel  that  is 
defined  on  S'  x  S'  automatically  induces  a  kernel  on  the  original  input  space,  through  map  compo¬ 
sition.  In  text  categorization,  this  framework  appears  as  an  alternative  to  the  Euclidean  geometry 
inherent  to  the  usual  bag-of-words  vector  representations.  In  fact,  approaches  that  map  data  to 
statistical  manifolds,  equipped  with  well-motivated  non-Euclidean  metrics  [Eafferty  and  Eebanon, 

2005] ,  often  outperform  support  vector  machine  (SVM)  classifiers  with  linear  kernels  [Joachims, 
2002].  Some  of  these  kernels  have  a  natural  information  theoretic  interpretation,  establishing  a 
bridge  between  kernel  methods  and  information  theory  [Cuturi  et  ah,  2005,  Hein  and  Bousquet, 
2005]. 

The  main  goal  of  this  paper  is  to  widen  that  bridge;  we  do  that  by  introducing  a  new  wide 
class  of  kernels  rooted  in  nonextensive  information  theory,  which  contains  previous  information 
theoretic  kernels  as  particular  elements.  The  Shannon  and  Renyi  entropies  [Shannon,  1948,  Renyi, 
1961]  share  the  extensivity  property:  the  joint  entropy  of  a  pair  of  independent  random  variables 
equals  the  sum  of  the  individual  entropies.  Abandoning  this  property  yields  the  so-called  nonexten¬ 
sive  entropies  [Havrda  and  Charvat,  1967,  Eindhard,  1974,  Eindhard  and  Nielsen,  1971,  Tsallis, 
1988],  which  have  raised  great  interest  among  physicists  in  modeling  certain  phenomena  {e.g., 
long-range  interactions  and  multifractals)  and  in  the  construction  of  a  nonextensive  generalization 
of  the  classical  Boltzmann-Gibbs  statistical  mechanics  [Abe,  2006].  Nonextensive  entropies  have 
also  been  recently  used  in  signal/image  processing  [Ei  et  ah,  2006]  and  many  other  areas  [Gell- 
Mann  and  Tsallis,  2004].  The  so-called  Tsallis  entropies  [Havrda  and  Charvat,  1967,  Tsallis,  1988] 
form  a  parametric  family  of  nonextensive  entropies  that  includes  the  Shannon-Boltzmann-Gibbs 
entropy  as  a  particular  case.  Some  attempts  have  been  made  to  construct  a  nonextensive  general¬ 
ization  of  information  theory  [Euruichi,  2006]. 

Convexity  is  a  key  concept  underlying  several  fundamental  results  in  information  theory,  e.g., 
the  non-negativity  of  the  Kullback-Leibler  (KL)  divergence  (also  called  relative  entropy),  namely 
via  the  many  implications  of  Jensen’s  inequality  [Cover  and  Thomas,  1991,  Jensen,  1906].  Jensen’s 
inequality  also  underlies  the  concept  of  Jensen-Shannon  (JS)  divergence,  which  is  a  symmetrized 
and  smoothed  version  of  the  KE  divergence  [Ein  and  Wong,  1990,  Ein,  1991].  The  JS  divergence  is 
widely  used  in  areas  such  as  statistics,  machine  learning,  image  and  signal  processing,  and  physics. 

In  this  paper,  we  introduce  new  extensions  of  JS-type  divergences  by  generalizing  its  two  pil¬ 
lars:  convexity  and  Shannon’s  entropy.  These  divergences  are  then  used  to  define  new  information- 
theoretic  kernels  between  probability  distributions.  More  specifically,  our  main  contributions  are: 

•  The  concept  of  q-convexity,  as  a  generalization  of  convexity,  for  which  we  prove  a  Jensen  q- 
inequality.  The  related  concept  of  Jensen  q-differences,  which  generalize  Jensen  differences. 
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is  also  proposed.  Based  on  these  concepts,  we  introduce  the  Jensen-Tsallis  q-difference,  a 
nonextensive  generalization  of  the  JS  divergence,  which  is  also  a  “mutual  information”  in 
the  sense  of  Furuichi  [2006]. 

•  Characterization  of  the  Jensen-Tsallis  g-difference,  with  respect  to  convexity  and  extrema, 
extending  the  work  by  Burbea  and  Rao  [1982]  and  by  Lin  [1991]  for  the  JS  divergence. 

•  Definition  of  k-th  order  joint  and  conditional  Jensen-Tsallis  g-differences  for  families  of 
stochastic  processes,  and  derivation  of  a  chain  rule. 

•  We  propose  a  broad  family  of  (nonextensive  information  theoretic)  positive  definite  kernels, 
which  are  interpretable  as  nonextensive  mutual  information  kernels.  This  family  ranges 
from  the  Boolean  to  the  linear  kernels,  and  also  includes  the  JS  kernel  proposed  by  Hein  and 
Bousquet  [2005]. 

•  We  define  a  family  of  (nonextensive  information  theoretic)  positive  definite  kernels  between 
stochastic  processes,  which  subsume  well-known  string  kernels  like  the  p-spectrum  kernel 
[Leslie  et  ah,  2002]. 

•  We  extend  results  of  Hein  and  Bousquet  [2005]  by  proving  positive  definiteness  of  kernels 
based  on  the  unbalanced  JS  divergence.  A  connection  between  these  new  kernels  and  those 
previously  studied  by  Fuglede  [2005]  and  by  Hein  and  Bousquet  [2005]  is  also  established. 
As  a  side  note,  we  show  that  the  parametrix  approximation  of  the  multinomial  diffusion 
kernel  introduced  by  Lafferty  and  Lebanon  [2005]  is  not  positive  definite  in  general. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  reviews  the  concepts  of  nonextensive 
entropies,  with  emphasis  on  the  Tsallis  case.  Section  3  introduces  denormalization  formulae  for 
several  entropies  and  divergences,  to  be  used  in  later  sections.  Section  4  discusses  Jensen  differ¬ 
ences  and  divergences.  The  concepts  of  g-differences  and  g-convexity  are  introduced  in  Section  5, 
where  they  are  used  to  define  and  characterize  some  new  divergence-type  quantities.  In  Section  6, 
we  define  the  Jensen-Tsallis  g-difference  and  derive  some  of  its  properties;  in  that  section,  we  also 
define  k-th  order  Jensen-Tsallis  g-differences  for  families  of  stochastic  processes.  The  new  family 
of  entropic  kernels  is  introduced  and  characterized  in  Section  7,  after  a  brief  review  of  some  key  re¬ 
sults  concerning  positive  definite  kernels;  that  section  also  presents  a  brief  review  of  string  kernels, 
and  introduces  nonextensive  kernels  between  stochastic  processes.  Section  7  ends  by  proving  that 
the  parametrix  approximation  of  the  multinomial  diffusion  kernel  is  not  positive  definite.  Section  8 
reports  experiments  on  text  categorization  using  both  a  bag-of-words  and  a  sequential  representa¬ 
tion  of  documents.  Finally,  Section  9  contains  concluding  remarks  and  discusses  directions  for 
future  research. 

Earlier  and  shorter  versions  of  this  work  have  appeared  in  Martins  et  al.  [2008a]  and  Martins 
et  al.  [2008b]. 
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2  Nonextensive  entropies  and  Tsallis  statistics 


We  start  with  a  brief  overview  of  nonextensive  entropies.  In  what  follows,  M+  denotes  the  nonneg¬ 
ative  reals,  M++  denotes  the  strictly  positive  reals,  and 

A  |(a;i, . . .  ,a;n)  e  I  =  1,  Vi  x*  >  o|  (1) 


denotes  the  (n  —  1) -dimensional  simplex. 

Inspired  by  the  Shannon-Khinchin  axiomatic  formulation  of  Shannon’s  entropy  [Khinchin, 
1957,  Shannon  and  Weaver,  1949],  Suyari  [2004]  proposed  an  axiomatic  framework  for  nonex¬ 
tensive  entropies  and  a  uniqueness  theorem.  Let  g  >  0  be  a  fixed  scalar,  called  the  entropic  index, 
and  let  fq  be  a  function  defined  on  Consider  the  following  set  of  axioms: 

(Al)  Continuity:  fq  is  continuous  in 


(A2)  Maximality:  For  any  q  >  0,  n  E  N,  and  (pi, . . .  ,Pn)  ^  A” 

fq{PU  •  •  •  ,Pn)  <  fqi^/n,  .  .  .  ,  l/u)] 

(A3)  Generalized  additivity:  For  i  =  1, . . . ,  n,  j  =  1, . . . ,  rui,  pij  >  0,  and  pi 

fqipil-i  ■  ■  ■  iPmui)  —  fqipi-}  ■  ■  ■  iPn)  + 


n 


Y.Pifq 


Pirrii 

Pi 


j=l  Pij' 


(A4)  Expandability:  fq{pi, . . .  ,p„,  0)  =  fq{pi, . . .  ,pn). 

The  Suyari  axioms  (A1)-(A4)  uniquely  determine  a  function  Sq^^  : 


of  the  form 


(2) 


where  kisa  positive  constant,  and  > 
three  conditions: 


-kTn=iPilnpi  ifg  =  l, 

M  is  a  continuous  function  that  satisfies  the  following 


(i)  has  the  same  sign  as  g— 1; 

(ii)  (j){q)  vanishes  if  and  only  if  g  =  1; 

(Hi)  (j)  is  differentiable  in  a  neighborhood  of  1  and  0'(1)  =  1. 

Note  that  =  lim^^i  Sq^^,  thus  Sq^^{pi, . . .  ,Pn),  seen  as  a  function  of  g,  is  continuous  at  g  = 
1.  For  any  f  satisfying  these  conditions,  Sq^^  has  the  pseudoadditivity  property:  for  any  two 
independent  random  variables  A  and  B,  with  probability  mass  functions  pa  E  and  pn  E 
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^ng  respectively,  consider  the  new  random  variable  A  ®  B  defined  by  the  joint  distibution 

Pa®Pb  e 

®B)  =  S,,^{A)  +  Sg,^{B)  -  ^  Sg,^{A)S,,^{B), 

where  we  denote  (as  usual)  Sq^^{A)  =  Sq^^{pA). 

For  q  =  1,  Suyari’s  axioms  recover  the  Shannon-Boltzmann-Gibbs  (SBG)  entropy, 

n 

Si,^{pi, . . .  ,Pn)  =  H{pi, .  .  .  ,Pn)  =  -/c^Pilnpi,  (3) 

i=l 

and  pseudoadditivity  turns  into  additivity,  i.e.,  H{A®  B)  =  H{A)  +  H{B)  holds. 

Several  proposals  for  0  have  appeared  in  the  literature  [Havrda  and  Charvat,  1967,  Daroczy, 
1970,  Tsallis,  1988].  In  the  sequel,  unless  stated  otherwise,  we  set  0(g)  =  g  —  1,  which  yields  the 
Tsallis  entropy: 

Sq{Pl,  ...,Pn)  =  ■  (4) 

To  simplify,  we  let  /c  =  1  and  write  the  Tsallis  entropy  as 

Sq{X)  =  Sq{pi,  .  .  .  ,p„)  =  -  ^  p{xY  \nqp{x),  (5) 

x&X 

where  lng(a;)  =  {x^~'^  —  1)/(1  —  g)  is  the  q-logarithm  function,  which  satisfies  \nq{xy)  =  lng(a;)  + 
x^~‘^  \nq{y)  and  \n.q{l/x)  =  —x^~^  lng(a;).  This  notation  was  introduced  by  Tsallis  [1988]. 

Furuichi  [2006]  derived  some  information  theoretic  properties  of  Tsallis  entropies.  Tsallis  joint 
and  conditional  entropies  are  defined,  respectively,  as 

Sq{X,Y)  = -Y,p{x,yY\nqp{x,y)  (6) 

x,y 

and 

S,(A'|y)  ^  =  ^p(j/)«S,(A|j/),  (7) 

x,y  y 

and  the  chain  rule  Sq{X,  Y)  =  Sq{X)  +  S'q(F|X)  holds. 

For  two  probability  mass  functions  px,  Py  ^  the  Tsallis  relative  entropy,  generalizing  the 
KL  divergence,  is  defined  as 

DqipxWpv)  = 

X  Px{x) 

Finally,  the  Tsallis  mutual  entropy  is  defined  as 

Iq{X-  Y)  ^  Sq{X)  -  Sq{X\Y)  =  Sq{Y)  -  Sq{Y\X),  (9) 

generalizing  (for  g  >  1)  Shannon’s  mutual  information  [Furuichi,  2006].  In  Section  6,  we  establish 
a  relationship  between  Tsallis  mutual  entropy  and  a  quantity  called  Jensen-Tsallis  q-difference. 
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generalizing  the  one  between  mutual  information  and  the  JS  divergence  (shown,  e.g.,  by  Grosse 
et  al.  [2002],  and  recalled  below,  in  Subsection  4.2). 

Furuichi  [2006]  also  mentions  an  alternative  generalization  of  Shannon’s  mutual  information, 
defined  as 

ig{X;Y)^Dg{px,Y\\Px®PY),  (10) 

where  px,Y  is  the  true  joint  probability  mass  function  of  {X,  Y)  and  px  ®  Py  denotes  their  joint 
probability  if  they  were  independent.  This  alternative  definition  of  a  “Tsallis  mutual  entropy”  has 
also  been  used  by  Lambert!  and  Majtey  [2003];  notice  that  Iq{X;Y)  ^  Iq{X]Y)  m  general,  the 
case  q  =  I  being  a  notable  exception.  In  Section  6,  we  show  that  this  alternative  definition  also 
leads  to  a  nonextensive  analogue  of  the  JS  divergence. 


3  Entropies  of  unnormalized  measures 

In  this  section,  we  consider  functionals  that  extend  the  domain  of  the  Shannon-Boltzmann-Gibbs 
and  Tsallis  entropies  to  include  unnormalized  measures.  Although,  as  shown  below,  these  func¬ 
tionals  are  completely  characterized  by  their  restriction  to  the  normalized  probability  distributions, 
the  denormalization  expressions  will  play  an  important  role  in  Section  7  to  derive  novel  positive 
definite  kernels  inspired  by  mutual  informations. 

In  order  to  keep  generality,  whenever  possible  we  do  not  restrict  to  finite  or  countable  sample 
spaces.  Instead,  we  consider  a  measured  space  (A,  ^ ,  u)  where  X  is  Hausdorff  and  z/  is  a  a-finite 
Radon  measure.  We  denote  by  M+{X)  the  set  of  finite  Radon  //-absolutely  continuous  measures 
on  X,  and  by  M]^{X)  the  subset  of  those  which  are  probability  measures.  For  simplicity,  we  often 
identify  each  measure  in  M+(T’)  or  M]^{X)  with  its  corresponding  nonnegative  density;  this  is 
legitimated  by  the  Radon-Nikodym  theorem,  which  guarantees  the  existence  and  uniqueness  (up 
to  equivalence  within  measure  zero)  of  a  density  function  f  :  X  ^  M+.  In  the  sequel,  Lebesgue- 
Stieltjes  integrals  of  the  form  f{x)diy{x)  are  often  written  as  X4  /,  or  simply  //,  if  A  =  X. 
Unless  otherwise  stated,  u  is  the  Lebesgue-Borel  measure,  if  X  CM”  and  intX  ^  0,  or  the 
counting  measure,  if  X  is  countable.  In  the  latter  case  integrals  can  be  seen  as  finite  sums  or 
infinite  series. 

3.1  Denormalization  of  the  Shannon-Boltzmann-Gibbs  Entropy  and  the  KL 
Divergence 

Define  M  =  M  U  {— cxd,  -fcx)}.  For  some  functional  G  :  Mj^{X)  — M,  let  the  set  M^{X)  = 
{/  G  M+{X)  :  |G(/)|  <  cx)}  be  its  effective  domain,  and  mI’^{X)  =  M^{X)  n  Ml{X)  be  its 
subdomain  of  probability  measures. 

The  following  functional  [Cuturi  and  Vert,  2005],  extends  the  Shannon-Boltzmann-Gibbs  en- 

1  H  TA 

tropy  from  Mfi  to  the  unnormalized  measures  in  Mfi : 

H{f)  =  -kJf\nf  =  J(pHof,  (11) 
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where  /c  >  0  is  a  constant,  the  function  ip h  '■  M++  — M  is  defined  as 

ifuiy)  = -kylny,  (12) 

and,  as  usual,  0  In  0  =  0. 

The  generalized  form  of  the  KL  divergence,  often  called  generalized  I -divergence  [Csiszar, 
1975],  is  a  directed  divergence  between  two  measures  G  such  that  pf  is  pg- 

absolutely  continuous  (denoted  /i/  Pg).  Let  /  and  g  be  the  densities  associated  with  pf  and  Pg, 
respectively.  In  terms  of  densities,  this  generalized  KL  divergence  is 

D{/,g)  =  kj  L-f  +  fhdV  (13) 

Both  functionals  H  and  D  are  completely  determined  by  their  restriction  to  the  normalized 
measures,  as  the  next  proposition  shows. 


Proposition  1  The  following  equalities  hold  for  any  c  G  M++  and  f,gG  with  pf  pg.- 

H{cf)  =  cH{f)  +  \f\ipH{c), 

D{cf,cg)  =  cD{f,g), 

D{cf,g)  =  cD{f,g)-\f\ipH{c)  +  k{l-c)\g\, 

where  |/|  =  //  =  Pf{X).  Consider  f  G  M^{X)  and  g  G  and  define  f  ®  g  E 

X  3^)  as  if  ®g){x,y)  =  f{x)g{y).  Then, 

H{f®g)  =  \g\H{f)  +  \f\H{g). 

Naturally,  if\f\  =  \g\  =  1,  we  recover  the  additivity  property  of  the  Shannon-Boltzmann-Gibbs 
entropy,  H{f  ®  g)  =  H{f  )  +  H{g). 

Proof:  S  traightforward  from  ( 1 1 )  and  (13).  ■ 


3.2  Denormalization  of  Nonextensive  Entropies 

Let  us  now  proceed  similarly  with  the  nonextensive  entropies.  For  o'  >  0,  letM:^(T’)  =  {/  G 
M+(T’)  :  /'^  G  Mj^{X)}  for  g  7^  1,  and  =  M^{X)  for  g  =  1.  The  nonextensive 

counterpart  of  (11),  defined  on  is 


Sqif)  =  j  Tq°  f, 


where  cpg  :  M++  ^  M  is  given  by 


Tqiy) 


Tniy) 


if  g  =  1, 
if  g  ^  1, 


(14) 


(15) 
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and  (f)  :  M+  — M  satisfies  conditions  (i)-(iii)  stated  following  equation  (2).  The  Tsallis  entropy  is 
obtained  for  0(g)  =  g  —  1, 

S,{f)  =  -k  J  rKf.  (16) 

Similarly,  a  nonextensive  generalization  of  the  generalized  KL  divergence  (13)  is 

Dqif,  g)  =  J  (g/  +  (1  -  q)g  -  rg^~'‘) ,  (17) 

forg^  1,  and  T>i(/,g)  =  lim^^i  Dg{f,g)  =  D{f,g). 

For  I/I  =  |g|  =  1,  several  particular  cases  are  recovered:  if  0(g)  =  1  —  2^“'^,  then  Dq{f,g) 
is  the  Havrda-Charvat  or  Daroczi  relative  entropy  [Havrda  and  Charvat,  1967,  Dardczy,  1970]; 
if  0(g)  =  g  —  1,  then  Dq{f,g)  is  the  Tsallis  relative  entropy  (8);  finally,  if  0(g)  =  g(g  —  1), 
then  Dq{f,g)  is  the  canonical  a-divergence  defined  by  Amari  and  Nagaoka  [2001]  in  the  realm 
of  information  geometry  (with  the  reparameterization  a  =  2g  —  1  and  assuming  g  >  0  so  that 
0(g)  =  g(g  —  1)  conforms  with  the  axioms). 

The  following  proposition  generalizes  Proposition  1  to  the  nonextensive  case. 


Proposition  2  The  following  equalities  hold  for  any  c  G  M++  and  f,gE  with  /i/  tig: 


Sqicf) 

=  C^Sqif)  +  \f\Pq{c), 

(18) 

Dq{cf,  eg) 

=  cDq{f,g), 

(19) 

Dq{cf,g) 

=  cWq{f,g)  qTq{e)\f\  +  1)(1  c'')|g|. 

(20) 

For  any  f  G  and  g  G  M^‘‘{y), 

S,{  f  ®  9)  =  \a\S,(f)  +  |/|S,(9)  -  ^S,(f)S,{g).  (21) 

¥\f\  =  \g\  =  we  recover  the  pseudo-additivity  property  of  nonextensive  entropies: 

s,(f  ®  9)  =  S,(/)  +  S,(9)  -  ^S,(/)S,(9). 

Proof:  S  traightforward  from  ( 1 4)  and  (17).  ■ 

For  0(g)  =  g  —  1,  Dg  is  the  Tsallis  relative  entropy  and  (20)  reduces  to 

Dq{cf,g)  =  cWq{f,g)  -  qpq{c)\f\  +  k{l  -  c^)\g\.  (22) 

Naturally,  all  the  equalities  in  Proposition  1  are  obtained  by  taking  the  limit  g  — 1  in  those  of 
Proposition  2. 
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4  Jensen  Differences  and  Divergences 

4.1  The  Jensen  Difference 

Jensen’s  inequality  [Jensen,  1906]  is  at  the  heart  of  many  important  results  in  information  theory. 
Let  E[]  denote  the  expectation  operator.  Jensen’s  inequality  states  that  if  Z  is  an  integrable  random 
variable  taking  values  in  a  set  Z,  and  /  is  a  measurable  convex  function  defined  on  the  convex  hull 
of  Z,  then 

f{E[Z])  <  E[f{Z)].  (23) 

Burbea  and  Rao  [1982]  considered  the  scenario  where  Z  is  finite,  and  took  /  =  —El^p,  where 
H^p  :  [a,  6]”  — M  is  a  concave  function,  called  a  (p-entropy,  defined  as 

n 

Hp{z)  ^ (24) 

i=l 

where  p  :  [a,b]  — M  is  convex.  They  studied  the  Jensen  difference 

(m  \  m 

(25) 

t=i  )  t=i 

where  tt  =  (tti,  . . . ,  Tim)  e  and  each  |/i, . . . ,  G  [a,  b]'^. 

We  consider  here  a  more  general  scenario,  involving  two  measured  sets  {X,  u)  and  (T,  r), 

where  the  second  is  used  to  index  the  first. 


Definition  3  Let  p  =  {pt)t&T  ^  a  family  of  measures  in  M^{X)  indexed  by  T,  and 

let  u  G  M+(T)  be  a  measure  in  T.  Define: 

J^(/i)  =  ptdrf)^  -  J^u{t)^{pt)dT{t)  (26) 

where: 

(i)  'k  is  a  concave  functional  such  that  dom  'k  C  M+{X); 

(ii)  uj{f)pt{,x)  is  T -integrable,  for  all  x  G  X; 

(iii)  /r  uj{f)ptdT{f)  G  domth; 

(iv)  pt  G  dom  fh,  for  all  t  G  T; 

(v)  uj{f)^{pt)  is  T -integrable. 

Ifoj  G  M\{T),  we  still  call  (26)  a  Jensen  difference. 

In  the  following  subsections,  we  consider  several  instances  of  Definition  3,  leading  to  several 
Jensen-type  divergences. 
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4.2  The  Jensen-Shannon  Divergence 

Let  p  be  a  random  probability  distribution  taking  values  in  {pt}t&r  according  to  a  distribution 
TT  G  M]^{T).  (In  classification/estimation  theory  parlance,  tt  is  called  the  prior  distribution  and 
Pt  —  p{-\t)  the  likelihood  function.)  Then,  (26)  becomes 

Jl{p)  =  ^>{E[p])-E[m{p)i  (27) 

where  the  expectations  are  with  respect  to  vr. 

Let  now  =  77,  the  Shannon-Boltzmann-Gibbs  entropy.  Consider  the  random  variables  T  and 
X,  taking  values  respectively  in  T  and  X,  with  densities  7r(t)  and  p{x)  =  ^q-p{x\t)'K{t).  Using 
standard  notation  of  information  theory  [Cover  and  Thomas,  1991], 

r{p)  A  =  H  (^J^TT{t)pt^  -  J^TT{t)H{pt) 

=  7r{t)H{X\T  =  t) 

=  H{X)  -  H{X\T) 

=  I{X-T),  (28) 

where  I{X;T)  is  the  mutual  information  between  X  and  T.  (This  relationship  between  JS  di¬ 
vergence  and  mutual  information  was  pointed  out  by  Grosse  et  al.  [2002].)  Since  I{X;T)  is  also 
equal  to  the  KL  divergence  between  the  joint  distribution  and  the  product  of  the  marginals  [Cover 
and  Thomas,  1991],  we  have 


.r{p)  =  H  {E\p\)  -  E[H{p)]  =  E[D{p\\E[p])].  (29) 

When  X  and  T  are  finite  with  \T\  =  m,  Jh{pi,  ■  ■  ■  ,Pm)  is  called  the  Jensen-Shannon  (JS) 
divergence  of  pi, . . .  with  weights  tti,  . . . ,  [Burbea  and  Rao,  1982,  Lin,  1991].  Equality 
(29)  allows  two  interpretations  of  the  JS  divergence: 

•  the  Jensen  difference  of  the  Shannon  entropy  of  p; 


•  the  expected  KL  divergence  from  p  to  the  expectation  of  p. 

A  remarkable  fact  is  that  J^(p)  =  E[D(p\\r)],Le.,r*  =  E[p]  is  aminimizerof  i7[i9(p||r)] 

with  respect  to  r.  It  has  been  shown  that  this  property  together  with  equality  (29)  characterize  the 
so-called  Bregman  divergences',  they  hold  not  only  for  fh  =  77,  but  for  any  concave  lb  and  the 
corresponding  Bregman  divergence,  in  which  case  is  the  Bregman  information  [Banerjee  et  al., 
2005]. 

When  |T|  =  2  and  tt  =  (1/2, 1/2),  p  may  be  seen  as  a  random  distribution  whose  value  on 
{pi,P2}  is  chosen  by  tossing  a  fair  coin.  In  this  case,  7<^^/^’^/^)(p)  =  JS{pi,p2),  where 


JS{PI,P2)  = 


77 

1 


Pi+P2\  77(pi) -f  77(p2) 


Pi 


Pi  +  P2 


+  \d{p^ 


Pi  +  P2 


(30) 
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as  introduced  by  Lin  [1991].  It  has  been  shown  that  \fjS  satisfies  the  triangle  inequality  (hence  be¬ 
ing  a  metric)  and  that,  moreover,  it  is  an  Hilbertian  metric^  [Endres  and  Schindelin,  2003,  Tops0e, 
2000],  which  has  motivated  its  use  in  kernel-based  machine  learning  [Cuturi  et  ah,  2005,  Hein  and 
Bousquet,  2005]  (see  Section  7). 

4.3  The  Jensen-Renyi  Divergence 

Consider  again  the  scenario  above  (Subsection  4.2),  with  the  Renyi  g-entropy 

Rqip)  =  (31) 

replacing  the  Shannon-Boltzmann-Gibbs  entropy.  It  is  worth  noting  that  the  Renyi  and  Tsallis 
g-entropies  are  monotonically  related  through 

i?,(p)  =  ln([l  +  (l-g)^,(p)]i^),  (32) 

or,  using  the  g-logarithm  function, 

Sq  (p)  =  Ing  exp  Rq  (p) .  (33) 

The  Renyi  g-entropy  is  concave  for  g  G  [0, 1)  and  has  the  Shannon-Boltzmann-Gibbs  entropy 
as  the  limit  when  g  — >  1.  Letting  ^  =  Rq,  (27)  becomes 

J]^^(p)  =  Rq(Elpj)-ElRq(p)].  (34) 

Unlike  in  the  JS  divergence  case,  there  is  no  counterpart  of  equality  (29)  based  on  the  Renyi  g- 
divergence 

R>R,(piljp2)  =  In  J pf  p\~‘^.  (35) 

When  X  and  T  are  finite,  we  call  in  (34)  the  Jensen-Renyi  (JR)  divergence.  Furthermore, 
when  \T\  =  2  and  tt  =  (1/2, 1/2),  we  write  Jr^{p)  =  JRq{pi,P2),  where 

=  R,  (^)  -  (36) 

The  JR  divergence  has  been  used  in  several  signal/image  processing  applications,  such  as  regis¬ 
tration,  segmentation,  denoising,  and  classification  [Ben-Hamza  and  Krim,  2003,  He  et  ah,  2003, 
Karakos  et  ah,  2007].  In  Section  7,  we  show  that  the  JR  divergence  is  (like  the  JS  divergence)  an 
Hilbertian  metric,  which  is  relevant  for  its  use  in  kernel-based  machine  learning. 

'a  metric  d  :  T  x  T  ^  M  is  Hilbertian  if  there  is  some  Hilbert  space  R  and  an  isometry  f  :  X  ^  R  such  that 
(P(x,  y)  =  (/(x)  —  f(y),f(x)  —  f{y))n  holds  for  any  x,y  G  X  [Hein  and  Bousquet,  2005]. 
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4.4  The  Jensen-Tsallis  Divergence 

Burbea  and  Rao  [1982]  have  defined  Jensen-type  divergenees  of  the  form  (27)  based  on  the  Tsallis 
g-entropy  Sq,  defined  in  (16).  Like  the  Shannon-Boltzmann-Gibbs  entropy,  but  unlike  the  Renyi 
entropies,  the  Tsallis  g-entropy,  for  finite  T,  is  an  instanee  of  a  (/^-entropy  (see  (24)).  Letting 
\1/  =  Sg,  (27)  beeomes 

J|.(p)  =  S,(i51p|)--B|S,(p)].  (37) 

Again,  like  in  Subseetion  4.3,  if  we  eonsider  the  Tsallis  g-divergenee, 

Dq{pi\\p2)  =  (^  “  / 

there  is  no  eounterpart  of  the  equality  (29). 

When  X  and  T  are  finite,  in  (37)  is  ealled  the  Jensen-Tsallis  (JT)  divergence  and  it  has 
also  been  applied  in  image  proeessing  [Ben-Hamza,  2006].  Unlike  the  JS  divergenee,  the  JT 
divergenee  laeks  an  interpretation  as  a  mutual  information.  Despite  this,  for  g  G  [1,  2],  it  exhibits 
joint  eonvexity  [Burbea  and  Rao,  1982].  In  the  next  seetion,  we  propose  an  alternative  to  the  JT 
divergenee  whieh,  amongst  other  features,  is  interpretable  as  a  nonextensive  mutual  information 
(in  the  sense  of  Furuiehi  [2006])  and  is  jointly  eonvex,  for  g  G  [0,1]. 


5  Convexity  and  g^-Differences 

5.1  Introduction 

This  seetion  introduees  a  novel  elass  of  funetions,  termed  Jensen  q-dijferences,  whieh  general¬ 
ize  Jensen  differenees.  Later  (in  Seetion  6),  use  will  these  funetions  to  define  the  Jensen-Tsallis 
q-difference,  whieh  we  will  propose  as  an  alternative  nonextensive  generalization  of  the  JS  diver¬ 
genee,  instead  of  the  JT  divergenee  diseussed  in  Subseetion  4.4.  We  begin  by  reealling  the  eoneept 
of  g-expeetation,  used  by  Tsallis  [1988]  in  nonextensive  thermodynamies. 


Definition  4  The  unnormalized  g-expeetation  of  a  random  variable  X,  with  probability  density  p, 
is 

(39) 


Eq[X]  ^  J  xp{x) 


Of  eourse,  g  =  1  eorresponds  to  the  standard  notion  of  expeetation.  For  g  7^  1,  the  g- 
expeetation  does  not  mateh  the  intuitive  meaning  of  average/expeetation  (e.g.,  Eq[l\  7^  1,  in 
general).  The  g-expeetation  is  a  eonvenient  eoneept  in  nonextensive  information  theory;  e.g.,  it 
yields  a  very  eompaet  form  for  the  Tsallis  entropy:  ^,(X)  =  -Eq[\nqPiX)]. 


5.2  g-Convexity 

We  now  introduee  the  novel  eoneept  of  g-eonvexity  and  use  it  to  derive  a  set  of  results,  namely  the 
Jensen  q-inequality. 
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Definition  5  Let  g  G  M  and  X  be  a  convex  set.  A  function  f  :  X  ^  is  g-convex  if  for  any 
x,y  E  X  and  X  G  [0, 1], 


f{\x  +  (1  -  A)|/)  <  X^fix)  +  (1  -  XYfiy).  (40) 

If —f  is  q-convex,  f  is  said  to  be  g-concave. 

Of  course,  1 -convexity  is  the  usual  notion  of  convexity.  The  next  proposition  states  the  Jensen 
g-inequality. 


Proposition  6  If  f  :  X  ^  is  q-convex,  then  for  any  n  G  N,  xi, . . . ,  G  X  and  tt  = 
(tti,  . . .  ,7rn)  G 

(n  \  n 

i=l  )  i=\ 

Moreover,  if  f  is  continuous,  the  above  still  holds  for  countably  many  points  {xi)i^}^. 


Proof:  In  the  finite  case,  the  proof  can  be  carried  out  trivially,  by  induction,  exactly  as  in  the 
proof  of  the  standard  Jensen  inequality  [Cover  and  Thomas,  1991].  If  /  is  continuous,  it  commutes 
with  taking  limits,  thus 


/ 


/ 


lim  / 


n 


oo 


< 


lim 

n— >oo 


Proposition  7  Let  /  >  0  and  g  >  r  >  0;  then, 

f  is  q-convex  ^  /  is  r -convex  (42) 

/  is  r-concave  ^  /  is  q-concave.  (43) 

Proof:  Implication  (42)  results  from 

f{Xx  +  {l-X)y)  <  X^f{x)  +  {l-XYf{y)  <  y  f{x)  +  {1  -  Xf  f{y), 

where  the  first  inequality  states  the  g-convexity  of  /  and  the  second  one  is  valid  because  f{x) ,  f{y)  > 
0  and  P  >t^  >  0,  for  any  t  G  [0, 1]  and  q  >  r.  The  proof  of  (43)  is  similar.  ■ 
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5.3  Jensen  ^-Differences 

We  now  generalize  Jensen  differences,  formalized  in  Definition  3,  by  introducing  the  concept  of 
Jensen  g-differences. 


Definition  8  Let  /i  =  ^  a  family  of  measures  in  M^{X)  indexed  by  T,  and 

let  u  G  M+(T)  be  a  measure  in  T.  For  g  >  0,  define 

HtdTf)^  -  J^u{ty  '^{fit)dT{t),  (44) 


where: 

(i)  4/  is  a  concave  functional  such  that  dom  fh  C 

(ii)  u(t)  fifix)  is  T-integrable  for  all  x  G  X; 

(iii)  Htdrf)  G  dom4/; 

(iv)  fit  £  dom  4/,  for  all  t  G  T; 

(v)  io{f)‘^  is  T-integrable. 

Ifuj  G  M\{T),  we  call  the  function  defined  in  (44)  a  Jensen  g-difference. 

Burbea  and  Rao  [1982]  established  necessary  and  sufficient  conditions  on  (p  for  the  Jensen 
difference  of  a  99-entropy  (see  (24))  to  be  convex.  The  following  proposition  generalizes  that 
result,  extending  it  to  Jensen  g-differences. 

Proposition  9  Let  T  and  X  be  finite  sets,  with  \T\  =  m  and  \X\  =  n,  and  let  n  G  M^(T).  Let 
99  :  [0, 1]  — M  be  a  function  of  class  and  consider  the  (ip-entropy  [Burbea  and  Rao,  1982]) 
function  4/  :  [0, 1]”  — M  defined  as  4^(2;)  =  —  Then,  the  q-difference  :  [0, 1]”™  — > 

M  is  convex  if  and  only  ifp  is  convex  and  — 1/99"  is  (2  —  q)-convex. 

The  proof  is  rather  long,  thus  it  is  relegated  to  Appendix  A. 


6  The  Jensen-Tsallis  ^-Difference 

6.1  Definition 

As  in  Subsection  4.2,  let  p  be  a  random  probability  distribution  taking  values  in  {ptjter  according 
to  a  distribution  vr  G  M]^{T).  Then,  we  may  write 

=  (45) 
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where  the  expectations  are  with  respect  to  tt.  Hence  Jensen  ^-differences  may  be  seen  as  defor¬ 
mations  of  the  standard  Jensen  differences  (27),  in  which  the  second  expectation  is  replaced  by  a 
g-expectation. 

Let  now  \1/  =  Sg,  the  nonextensive  Tsallis  g-entropy.  Introducing  the  random  variables  T  and 
X,  with  values  respectively  in  T  and  X,  with  densities  7r(t)  and  p{x)  =  J^p{x\t)TT{t),  we  have 
(writing  simply  as  T*) 

r;(p)  =  s,  (bIpI)  -  B,[s,(p)| 

=  Sg{X)-  lenity Sg{X\T  =  t) 

=  Sg{X)  -  Sg{X\T) 

=  lyX-T),  (46) 


where  Sg{X\T)  is  the  Tsallis  conditional  entropy  (7),  and  Iq{X]T)  is  the  Tsallis  mutual  infor¬ 
mation  (9),  as  defined  by  Furuichi  [2006].  Observe  that  (46)  is  a  nonextensive  analogue  of  (28). 
Since,  in  general,  Iq  y  Iq  (see  (10)),  unless  g  =  1  (in  that  case,  Ji  =  Ji  =  I),  there  is  no  counter¬ 
part  of  (29)  in  terms  of  g-differences.  Nevertheless,  Lambert!  and  Majtey  [2003]  have  proposed  a 
non-logarithmic  version  of  the  JS  divergence,  which  corresponds  to  using  Iq  for  the  Tsallis  mutual 
g-entropy  (although  this  interpretation  is  not  explicitally  mentioned  by  those  authors). 

When  X  and  T  are  finite  with  \T\  =  m,  we  call  the  quantity  T^{pi,  ■  ■  ■  ,Pm)  the  Jensen- 
Tsallis  (JT)  q-dijference  of  pi, . . .  with  weights  tti,  . . . ,  Although  the  JT  g-difference  is 
a  generalization  of  the  JS  divergence,  for  g  7^  1,  the  term  “divergence”  would  be  misleading  in 
this  case,  since  T'^  may  take  negative  values  (if  g  <  1)  and  does  not  vanish  in  general  if  p  is 
deterministic. 

When  |T|  =  2  and  tt  =  (1/2, 1/2),  define  Tq  = 


Tq{pi,P2)  =  Sq 


(1^) 


SgjPl)  +  Sq{p2) 
29 


(47) 


Notable  cases  arise  for  particular  values  of  g: 


•  For  g  =  0,  S'o(p)  =  —1  +  z/(supp(p)),  where  z/(supp(p))  denotes  the  measure  of  the  support 
of  p  (recall  that  p  is  defined  on  the  measured  space  {X,  u)).  For  example,  if  X  is  finite 

and  u  is  the  counting  measure,  z/(supp(p))  =  ||p||o  is  the  so-called  0-norm  (although  it  is  not 
a  norm)  of  vector  p,  i.e.,  its  number  of  nonzero  components.  The  Jensen-Tsallis  O-difference 
is  thus 

To{pi,P2)  =  -l  +  u  (^supp  2  ^  -  z/(supp(pi))  +  1  -  z/(supp(p2)) 

=  l  +  O  (supp(pi)  U  supp(p2))  -  V  (supp(pi))  -  z/  (supp(p2)) 

=  1  -  z/  (supp(pi)  n  supp(p2)) ;  (48) 


if  X  is  finite  and  z/  is  the  counting  measure,  this  becomes 

To{pi,P2)  =  1  -  Ibl  ©P2II0, 


(49) 


where  ©  denotes  the  Hadamard-Schur  {i.e.,  elementwise)  product.  We  call  Tq  the  Boolean 
dijference. 
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For  q  =  I,  since  Si{p)  =  H{p),  Ti  is  the  JS  divergence, 

Ti(Pi,P2)  =  JS{pi,P2). 


(50) 


•  For  q  =  2,  S2{p)  =  1  —  {p,p),  where  {a,b)  =  a{x)  b{x)  du^x)  is  the  inner  product 
between  a  and  b  (which  reduces  to  (a,  b)  =  Y.i  ca  h  if  ^  is  finite  and  u  is  the  counting 
measure).  Consequently,  the  Tsallis  2 -difference  is 


T2{Pi,P2) 


2-2 


(51) 


which  we  call  the  linear  difference. 


6.2  Properties  of  the  JT  g-difference 

This  subsection  presents  results  regarding  convexity  and  extrema  of  the  JT  g-difference,  for  several 
values  of  q,  extending  known  properties  of  the  JS  divergence  (q  =  1).  Some  properties  of  the  JS 
divergence  are  lost  in  the  transition  to  nonextensivity;  e.g.,  while  the  former  is  nonnegative  and 
vanishes  if  and  only  if  all  the  distributions  are  identical,  this  is  not  true  in  general  with  the  JT 
g-difference.  Nonnegativity  of  the  JT  g-difference  is  only  guaranteed  if  g  >  1,  which  explains  why 
some  authors  (e.g.,  Furuichi  [2006])  only  consider  values  of  g  >  1,  when  looking  for  nonextensive 
analogues  of  Shannon’s  information  theory.  Moreover,  unless  g  =  1,  it  is  not  generally  true  that 
Tg  (p, ...  ,p)  =  0  or  even  that  (p, ...  ,p,p')  >  (p, ...  ,p,p).  For  example,  the  solution  of  the 

optimization  problem 

min  Tg(pi,p2),  (52) 

pi£A^ 

is,  in  general,  different  from  p2,  unless  g  =  1.  Instead,  this  minimizer  is  closer  to  the  uniform 
distribution  if  g  G  [0, 1),  and  closer  to  a  degenerate  distribution,  for  g  G  (1,2]  (see  Fig.  1).  This 
is  not  so  surprising:  recall  that  T2(pi,p2)  =  \  —  in  this  case,  (52)  becomes  a  linear 

program,  and  the  solution  is  not  p2,  but  p\  =  6j,  where  j  =  arg  max*  p2i. 

We  start  by  recalling  a  basic  result,  which  essentially  confirms  that  Tsallis  entropies  satisfy  one 
of  the  Suyari  axioms  (see  Axiom  A2  in  Section  1),  which  states  that  entropies  should  be  maximized 
by  uniform  distributions. 

Proposition  10  Let  X  be  a  finite  set.  The  uniform  distribution  maximizes  the  Tsallis  entropy  for 
any  g  >  0. 

Proof:  Consider  the  problem 

maxS'q(p),  subject  to  YjiPi  =  ^  nnd  Pi  >  0. 

Equating  the  gradient  of  the  Lagrangian  to  zero  yields 

^  {Sq{p)  +  KT.iPi  - 1))  =  -q{q  -  +  A  =  o, 

for  all  i.  Since  all  these  equations  are  identical,  the  solution  is  the  uniform  distribution,  which  is  a 
maximum,  due  to  the  concavity  of  Sq.  ■ 
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Jensen  Tsallis  q-Difference  to  a  fixed  Bernoulli  (Pg=0.3) 


Figure  1:  Jensen-Tsallis  g-difference  between  two  Bernoulli  distributions,  pi  =  (0.3,  0.7)  and 
P2  =  (p,  1  —  p),  for  several  values  of  the  entropic  index  q.  Observe  that,  for  q  e  [0, 1),  the 
minimizer  of  the  JT  g-difference  approaches  the  uniform  distribution  (0.5,  0.5)  as  g  approaches  0; 
for  g  G  (1,2],  this  minimizer  approaches  the  degenerate  distribution,  as  g  ^  2. 


The  next  corollary  of  Proposition  9  establishes  the  joint  convexity  of  the  JT  g-difference,  for 
g  G  [0, 1].  (Interestingly,  this  “complements”  the  joint  convexity  of  the  JT  divergence  (37),  for 
g  G  [1,2],  which  was  proved  by  Burbea  and  Rao  [1982].) 


Corollary  11  LetT  andX  be  finite  sets  with  cardinalities  m  and  n,  respectively.  Forq  G  [0, 1],  the 
JT  q-dijference  is  a  jointly  convex  function  on  M+'^’(T’).  Formally,  let  cind  i  =  1, . . . ,  /, 

be  a  collection  of  I  sets  of  probability  distributions  on  X;  then,  for  any  (Ai, . . . ,  A/)  G 

/I  i  \  i 


t;  I  E  pSI  ■  ■  ■ .  E  p™)  <  E  rpp'i 


(i) 


X) 


\i=l 


2=1 


2=1 


Proof:  Observe  that  the  Tsallis  entropy  (5)  of  a  probability  distribution  pt 
can  be  written  as 


Sq{Vt) 


n 

where  (pq{x) 

2=1 


X  — 

1  -  g  ’ 


{ptl,---,Ptn} 


thus,  from  Proposition  9,  Tjj  is  convex  if  and  only  if  pq  is  convex  and  is  (2  —  g)-convex. 

Since  Pq{x)  =  qx^~‘^,  pq  is  convex  for  x  >  0  and  g  >  0.  To  show  the  (2  —  g)-convexity 
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of  —l/ip'l^{x)  =  —{l/q)x^  for  Xt  >  0,  and  q  G  [0, 1],  we  use  a  version  of  the  power  mean 
inequality  [Steele,  2006], 


i=l 


I 

E 

2=1 


thus  concluding  that  —\lq)"  is  in  fact  (2  —  g) -convex. 


The  next  corollary,  which  results  from  the  previous  one,  provides  an  upper  bound  for  the  JT 
g-difference,  for  g  G  [0, 1].  (Notice  that  this  result  is  weaker  than  that  of  Proposition  13  below.) 


Corollary  12  Let  X,  T  and  q  be  as  in  Corollary  11.  Then,  . . .  ,Pm)  <  Sg{n). 

Proof:  From  Corollary  11,  for  g  G  [0, 1],  Tf{pi, . . .  ,pm)  is  convex.  Since  its  domain  is  a 
convex  polytope  (the  cartesian  product  of  m  simplices),  its  maximum  occurs  on  a  vertex,  i.e.,  when 
each  argument  pt  is  a  degenerate  distribution  at  Xt,  denoted  5xf  In  particular,  if  IT"!  >  |T|,  this 
maximum  occurs  at  the  vertex  corresponding  to  disjoint  degenerate  distributions,  i.e.,  such  that 
Xi  f  Xj  Hi  f  j.  At  this  maximum, 

(m  \  m 

t=i  /  t=i 

(m  \ 

(53) 

=  S,{n)  (54) 

where  the  equality  in  (53)  results  from  Sq  )  =  0.  Notice  that  this  maximum  may  not  be  achieved 
if|A^|<|T|.  ■ 


The  next  proposition  (proved  in  Appendix  B)  establishes  (upper  and  lower)  bounds  for  the  JT 
g-difference,  extending  Corollary  12  to  any  non- negative  g  and  to  countable  X  and  T. 

Proposition  13  Let  T  and  X  be  countable  sets.  For  g  >  0, 

Tq{pi,...,Pm)  ^SqiTl),  (55) 

and,  if\X\  >  \T\,  the  maximum  is  reached  for  a  set  of  disjoint  degenerate  distributions.  As  in 
Corollary  12,  this  maximum  may  not  be  attained  if\X\  <  \T\. 

For  g  >  1, 

>  0,  (56) 

and  the  minimum  is  attained  in  the  purely  deterministic  case,  i.e.,  when  all  distributions  are  equal 
to  same  degenerate  distribution. 

For  g  G  [0, 1]  and  X  a  finite  set  with  \X\  =  n, 

Tj;{p,,...,pm)>Sq{7r)[l-n^-f.  (57) 

This  lower  bound  (which  is  zero  or  negative)  is  attained  when  all  distributions  are  uniform. 
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Finally,  the  next  proposition  characterizes  the  convexity/concavity  of  the  JT  g-difference. 


Proposition  14  Let  T  and  X  be  countable  sets.  The  JT  q-difference  is  convex  in  each  argument, 
for  q  G  [0,  2],  and  concave  in  each  argument,  for  q  >2. 

Proof:  Notice  that  the  JT  g-difference  can  be  written  as  Tf  {pi, . . .  ,pm)  =  Y.j  fiPij,---,  Pmj ) , 


with 


1 

g-  1 


^(TTi  -  TT^i)yi  +  ^ nfyj 


It  suffices  to  consider  the  second  derivative  of  f  with  respect  to  yi.  Introducing  2:  =  YfiL2  '^i  yi’ 


Wi 


=  q 


q  q—2 

n  yl 


nl  (ttiI/i  +  zY  ^ 


= 


[TiiyiY  ^  -  (jiiyi  +  zY  ^ 


(58) 


Since  tti  yi  <  (tti  yi  +  z)  <  1,  the  quantity  in  (58)  is  nonnegative  for  g  e  [0,  2]  and  non-positive 
for  g  >  2.  ■ 


6.3  Joint  and  conditional  JT  ^-differences  and  a  chain  rule 

This  subsection  introduces  joint  and  conditional  JT  g-differences,  which  will  later  be  used  as  a 
contrast  measure  between  stochastic  processes.  A  chain  rule  is  derived  that  relates  conditional  and 
joint  JT  g-differences. 

Definition  15  Let  X,  y  and  T  be  measured  sets.  Let  {pt)t&r  G  [M]^{X  X  yY  be  a  family  of 
measures  in  M]^{X  x  3^)  indexed  by  T,  and  let  p  be  a  random  probability  distribution  taking 
values  in  {pt}t&r  according  to  a  distribution  tt  G  M\{T).  Consider  also: 

•  for  each  t  E  T,  the  marginals  Pt(Y)  G  M^{y), 

•  for  each  t  E  T  and  y  E  y,  the  conditionals  pt{X\Y  =  y)  E  M^{X), 

•  the  mixture  r{X,Y)  =  pt{X,Y)  E  M^{X  x  3^), 

•  the  marginal  r{Y)  E  M^{y), 

•  for  each  y  E  y,  the  conditionals  r{X\Y  =  y)  E  M^{X). 

For  notational  convenience,  we  also  append  a  subscript  to  p  to  emphasize  its  joint  or  conditional 
dependency  of  the  random  variables  X  and  Y,  i.e.,  pxY  —  P,  and  px\Y  denotes  a  random  condi¬ 
tional  probability  distribution  taking  values  in  {pt{.\Y)}t^r  according  to  the  distribution  n. 

For  q  >0,  we  call  joint  JT  g-difference  ofpxY  to 

T;{pxy)  =  T;(p)  =  SYr)  -  [^,(Pi)]  (59) 
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and  conditional  JT  g-difference  of  px\Y  to 

Tq{PX\Y)  —  Eqy^r{Y)  [Sq{r{.\Y  =  y))]  -  -Eg,T~7r(T)  Eqy^p^iy)  [Sq{pt{-\y  =  V))]  ,  (60) 

where  we  appended  the  random  variables  being  used  in  each  q-expectation,  for  the  sake  of  clarity. 

Note  that  the  joint  JT  g-differenee  is  just  the  usual  JT  g-differenee  of  the  joint  random  variable 
X  xY,  whieh  equals  (ef.  (46)) 

T^{Pxy)  =  Sq{X,Y)-Sq{X,Y\T)  =  Iq{XxY-T),  (61) 

and  the  eonditional  JT  g-differenee  is  nothing  but  the  usual  JT  g-differenee  with  all  entropies 
replaeed  by  eonditional  entropies  (eonditioned  on  Y).  Indeed,  expression  (60)  ean  be  rewritten  as: 

T-{pxiy)  =  Sq{X\Y)  -  Sq{X\T,Y)  =  Iq{X;T\Y),  (62) 

i.e.,  the  eonditional  JT  g-differenee  may  also  interpreted  as  a  Tsallis  mutual  information,  as  in  (46), 
but  now  conditioned  on  the  random  variable  Y . 

Note  also  that,  for  g  =  1  (the  extensive  ease),  (60)  may  also  be  rewritten  in  terms  of  the 
eonditional  KL  divergenees, 

r{px\Y)  =  Tf{px\Y)  =  Ey^riY)  [H{r{.\Y  =  y))]  -  [EY^r>dY)  [H{pMY  =  I/))]' 

=  ^T~7r(T)  [£^y~r(y)  [D{pt{.\Y  =  y)\\r{.\Y  =  y))]]  .  (63) 

Proposition  16  The  following  chain  rule  holds: 

Tq  {pxy)  =  Tq  {px\y)  +  Tq  {py)  (64) 

Proof:  Writing  the  joint/eonditional  JT  g-differenees  as  joint/conditional  mutual  informations 
(61)-(62)  and  invoking  the  ehain  rule  provided  by  (7),  we  have  that 

I{X]T\Y)  +  I{Y]T)  =  H{X\T,Y)-H{X\Y)  +  H{Y\T)-H{Y) 

=  H{X,Y\T)-H{X,Y),  (65) 

whieh  is  the  joint  JT  g-differenee  assoeiated  with  the  random  variable  X  x  Y.  ■ 

Let  us  now  turn  our  attention  to  the  ease  where  Y  =  X^  for  some  /c  G  N.  In  the  following,  the 
notation  (A„)„gN  denotes  a  stationary  ergodie  proeess  with  values  on  some  finite  alphabet  A. 

Definition  17  Let  X  and  T  be  measured  sets,  with  X  finite,  and  let  ^  be  a  family 

of  stochastic  processes  ( taking  values  on  the  alphabet  X )  indexed  by  T.  The  k-th  order  JT  g- 
differenee  o/^  is  defined,  for  k  =  1, . . .  ,n,  as 

TiT’%^)  =  T;{px>^)  (66) 

and  the  k-th  order  eonditional  JT  g-differenee  o/^  is  defined,  for  k  =  1, . . .  ,n,  as 

n:r’"(^)=T;(pxix^),  (67) 

and,  for  k  =  Q,  as  T;;o”''’"(^)  ^  =  Tf{px). 
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Proposition  18  The  joint  and  conditional  k-th  order  JT  q-dijferences  are  related  through: 


k-l 


=  i: 


i=0 


Proof:  Use  Proposition  16  and  induction. 


(68) 


6.4  Asymptotic  Analysis  in  the  Extensive  Case 

We  now  focus  on  the  extensive  case  (g  =  1)  for  a  brief  asymptotic  analysis  of  the  k-th  order 
joint  and  conditional  JT  l-differences  (or  conditional  Jensen-Shannon  divergences)  when  k  goes 
to  infinity. 

The  conditional  Jensen-Shannon  divergence  was  introduced  by  El-Yaniv  et  al.  [1998]  to  address 
the  two-sample  problem  for  strings  emitted  by  Markovian  sources.  Given  two  strings  s  and  t,  the 
goal  is  to  decide  whether  they  were  emitted  by  the  same  source  or  by  different  sources.  Under 
some  fair  assumptions,  the  most  likely  k-th  order  Markovian  joint  source  of  s  and  t  is  governed  by 
a  distribution  f  given  by 


f  =  arg  min  Ai9(ps||r)  +  (1  —  X)D{pt\\r). 


(69) 


where  /J(.||.)  are  conditional  KL  divergences,  Ps  and  pt  are  the  empirical  {k  —  l)-th  order  condi¬ 
tionals  associated  with  s  and  t,  respectively,  and  A  =  |s|/(|s|  -f  |t|)  is  the  length  ratio.  The  solution 
of  the  optimization  problem  is 


r  a  c  = 


Xps 


Xps{c)  +  (1  -  X)pt{c] 


Ps{a\c)  + 


(1  -  A)pt(c) 


Xps{c)  +  (1  -  X)pt{c] 


Pt{a\c) 


(70) 


where  a  G  Al  is  a  symbol  and  c  G  is  a  context;  this  can  be  rewritten  as  f(a,  c)  =  Aps(a,  c)  -f 
(1  —  \)pt{a,  c);  i.e.,  the  optimum  in  (69)  is  a  mixture  of  ps  and  pt  weighted  by  the  string  lengths. 
Notice  that,  at  the  minimum,  we  have 

D{ps\\f)  +  (1  -  \)D{pt\\r)  =  JSr^’^^’^-^\ps,pt).  (71) 


It  is  tempting  to  investigate  the  asymptotic  behavior  of  the  conditional  and  joint  JS  divergences, 
when  k  ^  oo;  however,  unlike  other  asymptotic  information  theoretic  quantities,  like  the  entropy 
rate  or  the  cross  entropy  rate,  this  behavior  fails  to  characterize  the  sources  s  and  t.  Intuitively,  this 
is  justified  by  the  fact  that  observing  more  and  more  symbols  drawn  from  the  mixture  of  the  two 
sources  rapidly  decreases  the  uncertainty  about  which  source  generated  the  sample.  Indeed,  from 
the  asymptotic  equipartition  property  of  stationary  ergodic  sources  [Cover  and  Thomas,  1991],  we 
have  that  limfc^oo  lH{pxJ  =  lim^^oo  H{px\xJ,  which  implies 

lim  =  lim  <  lim  -H^n)  =  0,  (72) 

k->oo  ^  k->oo  k  ~  k->oo  k 

where  we  used  the  fact  that  the  JS  divergence  is  upper-bounded  by  the  entropy  of  the  mixture 
H{7t)  (see  Proposition  13).  Since  the  conditional  JS  divergence  must  be  non-negative,  we  therefore 
conclude  that  limfc^oo  =  0,  pointwise. 
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7  Nonextensive  mutual  information  kernels 


7.1  Introduction 

In  this  section  we  consider  the  application  of  extensive  and  nonextensive  entropies  to  define  ker¬ 
nels  on  measures;  since  kernels  involve  pairs  of  measures,  throughout  this  section  \T\  =  2.  Based 
on  the  denormalization  formulae  presented  in  Section  3,  we  devise  novel  kernels  related  to  the  JS 
divergence  and  the  JT  g-difference;  these  kernels  allow  setting  a  weight  for  each  argument,  thus 
will  be  called  weighted  Jensen-Tsallis  kernels.  We  also  introduce  kernels  related  to  the  JR  diver¬ 
gence  (Subsection  4.3)  and  the  JT  divergence  (Subsection  4.4),  and  establish  a  connection  between 
the  Tsallis  kernels  and  a  family  of  kernels  investigated  by  Hein  et  al.  [2004]  and  Fuglede  [2005], 
placing  those  kernels  under  a  new  information-theoretic  light.  After  that,  we  give  a  brief  overview 
of  string  kernels,  and  using  the  results  of  Subsection  6.3,  we  devise  /c-th  order  Jensen-Tsallis  ker¬ 
nels  between  stochastic  processes  that  subsume  the  well-known  p-spectrum  kernel  of  Leslie  et  al. 
[2002].  Finally,  we  show  that  the  parametrix  approximation  of  the  multinomial  diffusion  kernel, 
proposed  by  Lafferty  and  Lebanon  [2005],  is  not  positive  definite  in  general. 

7.2  Positive  and  negative  definite  kernels 

We  start  by  recalling  basic  concepts  from  kernel  theory  [Scholkopf  and  Smola,  2002];  in  the  fol¬ 
lowing,  X  denotes  a  nonempty  set. 

Definition  19  Let  i.p  ■.  X  x  X  ^  M.  be  a  symmetric  function,  i.e.,  a  function  satisfying  ip{y,x)  = 
(p{x,  y),for  all  x,y  E  X.  ip  is  called  a  positive  definite  (pd)  kernel  if  and  only  if 

n  n 

EEc.  Cjip{xi,Xj)  >i)  (73) 

i=l j=l 

for  all  n  E  'R,  Xi, . . . ,  Xn  E  X  and  q,  . . . ,  Cn  G  M. 

Definition  20  Let  ip  :  X  X  X  ^  be  symmetric,  ip  is  called  a  negative  definite  (nd)  kernel  if  and 
only  if 

n  n 

EEc.  Cjil){xi,Xj)  <i)  (74) 

i=l  j=l 

for  all  n  E  N,  Xi, ...  ,Xn  E  X  and  Ci, ...  ,Cn  E  M,  satisfying  the  additional  constraint  Ci  -f 
. . .  -t-  c„  =  0.  In  this  case,  —f  is  called  conditionally  pd;  obviously,  positive  definiteness  implies 
conditional  positive  definiteness. 

The  sets  of  pd  and  nd  kernels  are  both  closed  under  pointwise  sums/integrations,  the  former 
being  also  closed  under  pointwise  products;  moreover,  both  sets  are  closed  under  pointwise  con¬ 
vergence.  While  pd  kernels  “correspond”  to  inner  products  via  embedding  in  a  Hilbert  space,  nd 
kernels  that  vanish  on  the  diagonal  and  are  positive  anywhere  else,  “correspond”  to  squared  Hilber- 
tian  distances.  These  facts,  and  the  following  propositions  and  lemmas,  are  shown  in  Berg  et  al. 
[1984]. 
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Proposition  21  Let  tf]  ■.  X  x  X  ^  "K  be  a  symmetric  function,  and  Xq  G  X.  Let  (p  \  X  x  X 
be  given  by 

p(x,  y)  =  fix,  xo)  +  f(y,  xq)  -  f(x,  y)  -  f{xo,  xq).  (75) 

Then,  p  is  pd  if  and  only  if  ip  is  nd. 

Proposition  22  The  function  ip  ■.  X  x  X  ^  M.  is  a  nd  kernel  if  and  only  if  exp{—tip)  is  pd  for  all 
t  >  0. 

Proposition  23  The  function  ip  :  X  x  X  ^  M+  is  a  nd  kernel  if  and  only  if  if  +  ip)~^  is  pdfor  all 
t  >  0. 

Lemma  24  If  ip  is  nd  and  nonnegative  on  the  diagonal,  i.e.,  ip{x,  x)  >  D  for  all  x  E  X,  then  so 
are  ip°‘,  for  a  G  [0, 1],  and  ln(l  +  ip). 

Lemma  25  If  f  ;  A"  — M  satisfies  /  >  0,  then,  for  a  G  [1,  2],  the  function  ipa{x,y)  =  —{f{x)  + 
f{y))°'  is  a  nd  kernel. 

The  following  definition  [Berg  et  al.,  1984]  has  been  used  in  a  machine  learning  context  by 
Cuturi  and  Vert  [2005]. 

Definition  26  Let  {X,  +)  be  a  semigroup.^  A  function  99  :  A’  — >  M  w  called  pd  (in  the  semigroup 
sense)  ifk'.XxX^^,  defined  as  k(x,  y)  =  p(x  +  y),  is  a  pd  kernel.  Likewise,  p  is  called  nd 
if  k  is  a  nd  kernel.  Accordingly,  these  are  called  semigroup  kernels. 

7.3  Jensen-Shannon  and  Tsallis  kernels 

The  basic  result  that  allows  deriving  pd  kernels  based  on  the  JS  divergence  and,  more  generally, 
on  the  JT  g-difference,  is  the  fact  that  the  denormalized  Tsallis  g-entropies  (14)  are  nd  functions 
on  Mjf{X),  for  g  G  [0,  2].  Of  course,  this  includes  the  denormalized  Shannon-Boltzmann-Gibbs 
entropy  (11)  as  a  particular  case,  corresponding  to  g  =  1.  Although  part  of  the  proof  was  given 
by  Berg  et  al.  [1984]  (and  by  Topspe  [2000]  and  Cuturi  and  Vert  [2005]  for  the  Shannon  entropy 
case),  we  present  a  complete  proof  here. 


Proposition  27  For  q  G  [0,  2],  the  denormalized  Tsallis  q-entropy  Sq  is  a  nd  function  on  Mjf{X). 

Proof:  Since  nd  kernels  are  closed  under  pointwise  integration,  it  suffices  to  prove  that  pg 
(see  (15))  is  nd  on  M+.  For  g  7^  1,  Pq{y)  =  (q  —  ~  Let’s  consider  two  cases  separately: 

if  g  G  [0, 1),  Pq{y)  equals  a  positive  constant  times  — t  +  where  i{y)  =  y  is  the  identity  map 
defined  on  M_|_.  Since  the  set  of  nd  functions  is  closed  under  sums,  we  only  need  to  show  that  both 
—L  and  F  are  nd.  Both  i  and  —l  are  nd,  as  can  easily  be  seen  from  the  definition;  besides,  since  t  is 
nd  and  nonnegative.  Lemma  24  guarantees  that  F  is  also  nd.  For  the  second  case,  where  g  G  (1,2], 

^Recall  that  {X,  +)  is  a  semigroup  if  +  is  a  binary  operation  in  X  that  is  associative  and  has  an  identity  element. 
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^Pqiy)  equals  a  positive  constant  times  i  —  It  only  remains  to  show  that  —l‘^  is  nd  for  q  G  (1,  2]: 
Lemma  25  guarantees  that  the  kernel  k{x,  y)  =  —{x  +  yY  is  nd;  therefore  — is  a  nd  function. 
For  q  =  1,  we  use  the  fact  that, 


ipi{x)  =  (Ph{x) 


—xhix 


X  —  x^ 

lim - 

5^1  q  —  I 


lim  ^q{x), 

q->l 


where  the  limit  is  obtained  by  L’Hopital’s  rule;  since  the  set  of  nd  functions  is  closed  under  limits, 

99i(a;)isnd.  ■ 


The  following  lemma  [Berg  et  ah,  1984]  will  also  be  needed  below. 


Lemma  28  The  function  Cq  ■  1^++  defined  as  (qiu)  =  U  ^  is  pd,  for  q  G  [0, 1]. 

Proof:  We  need  to  show  that  kq{x,  y)  :  M++  x  M++  — M,  defined  as  kq{x,  y)  =  (^q{x  +  y), 
is  pd,  for  q  G  [0,1].  The  proof  results  from  observing  that 

y)  =  {x  +  y)~^  =  ^lim  [t  +  {x  +  yfY^ ,  (76) 

which  is  always  well  defined  because  x  +  y  >  0,  combined  with  the  following  facts:  from 
Lemma  24,  since  {x,y)  ^  x  +  y  is  nd  and  nonnegative,  (x,  y)  ^  {x  +  yY  is  nd;  from  Proposi¬ 
tion  23,  (x,  y)  ^  [t  +  {x  +  yY]~^  is  pd  for  any  t  >  0;  the  set  of  pd  kernels  is  closed  under  limits. 


We  are  now  in  a  position  to  present  the  main  contribution  of  this  section,  which  is  a  family  of 
weighted  Jensen-Tsallis  kernels,  generalizing  the  JS-based  (and  other)  kernels  in  two  ways: 

•  they  allow  using  unnormalized  measures;  equivalently,  they  allow  using  different  weights 
for  each  of  the  two  arguments; 

•  they  extend  the  mutual  information  feature  of  the  JS  kernel  to  the  nonextensive  scenario. 


Definition  29  (weighted  Jensen-Tsallis  kernels)  The  kernel  kq  :  M+‘'(T’)  x  R  is 

defined  as 

kqidl.pYj  =  kq{xilPl,Xl2P2) 

=  {SqY)  -T^{pi,P2))  {UJ1+UJ2Y, 

where  pi  =  pi/uii  and  p2  =  ^2/^2  <^re  the  normalized  counterparts  of  pi  and  fi2,  with  corre¬ 
sponding  masses  001,002  ^  cind  n  =  (a;i/(ci;i  oj2),oj2/ {ooi  -{-002))- 

The  kernel  kq  :  \  {0}^  — M  is  defined  as 

kq{pi,P2)  =  kq{0JiPi,0J2P2)  =  Sq^)  -  {pi,  P2)  ■ 
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Recalling  (46),  notice  that  Sg{n)  —  T^{pi,P2)  =  Sg(T)  —  Ig{X;T)  =  Sg(T\X)  can  be  inter¬ 
preted  as  the  Tsallis  posterior  conditional  entropy.  Hence,  kg  can  be  seen  (in  Bayesian  classifica¬ 
tion  terms)  as  a  nonextensive  expected  measure  of  uncertainty  in  correctly  identifying  the  class, 
given  the  prior  vr  =  (tti,  712),  and  a  random  sample  from  the  mixture  distribution  vripi  -f  'K2P2-  The 
more  similar  the  two  distributions  are,  the  greater  this  uncertainty. 

Proposition  30  The  kernel  kg  is  pd,  for  q  G  [0,  2],  The  kernel  kg  is  pd,  for  G  [0, 1]. 

Proof:  With  =  ujipi  and  112  =  ^2P2  and  using  the  denormalization  formula  of  Proposi¬ 
tion  2,  we  obtain  kg{pi,  p,2)  =  —Sq{p,i  +  p,2)  +  Sq{p,i)  +  Sq{p2)-  Now  invoke  Proposition  21  with 
Ip  =  Sq  (which  is  nd  by  Proposition  27),  x  =  p,i,y  =  p,2,  and  Xq  =  0  (the  null  measure).  Observe 
now  that  kq{p,i,  P2)  =  kg{pi,  P2)i^i  +  1^2)“^.  Since  the  product  of  two  pd  kernels  is  a  pd  kernel 
and  (Proposition  28)  {uji  4-  a;2)“'^  is  a  pd  kernel,  for  g  G  [0, 1],  we  conclude  that  kg  is  pd.  ■ 

As  we  can  see,  the  weighted  Jensen-Tsallis  kernels  have  two  inherent  properties:  they  are 
parameterized  by  the  entropic  index  q  and  they  allow  their  arguments  to  be  unbalanced,  i.e.,  to 
have  different  weights  Ui.  We  now  mention  some  instances  of  kernels  where  each  of  these  degrees 
of  freedom  is  suppressed.  We  start  by  the  following  subfamily  of  kernels,  obtained  by  setting 
g  =  1. 


Definition  31  (weighted  Jensen-Shannon  kernels)  The  kernel  kT^yjs  :  R  is  de¬ 

fined  as  ky/js  —  ki,  i.e., 

=  kwJs{^iPi^^2P2) 

=  (//(tt)  -  J^(pi,P2))  (^1 +^2), 

where  pi  =  pi/oJi  and  p2  =  ^2/^2  <^re  the  normalized  counterpart  of  pi  and  p2,  ^nd  tt  = 
(a;i/(o;i  -f  a;2),a;2/(a;i  -f  cua)). 

Analogously,  the  kernel  ky^jg  :  {jvI^{X)  \  {0}^  — M  is  simply  kyyjg  =  ki,  i.e., 
kwJS{TuT2)  =  kwjs{^lPuUJ2P2)  =  H{ti)  -  r{pi,P2)- 


Corollary  32  The  weighted  Jensen-Shannon  kernels  kyyjg  and  ky^jg  are  pd. 

Proof:  Invoke  Proposition  30  with  g  =  1.  ■ 

The  following  family  of  weighted  exponentiated  JS  kernels,  generalize  the  so-called  exponen¬ 
tiated  JS  kernel,  that  has  been  used,  and  shown  to  be  pd,  by  Cuturi  and  Vert  [2005]. 

Definition  33  (Exponentiated  JS  kernel)  The  kernel  k^jg  :  M^(A’)  x  M]_{X)  M  is  defined, 
for  t  >  Q,  as 

kEJs{Pi,P2)  =  exp  [-t  JS  {pi,P2)]  ■  (77) 
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Definition  34  (Weighted  exponentiated  JS  kernels)  The  kernel  ;  M^{X)  x  M^{X) 

M  is  defined,  for  t  >  as 


kwEJs{Ti^d2) 


exp[tA;^y5(/ii,/i2)] 
exp(tif(7r))  exp  [-t.r{j)i,p2)\  ■ 


(78) 


Corollary  35  The  kernels  are  pd.  In  particular,  k^jg  is  pd. 

Proof:  Results  from  Proposition  22  and  Corollary  32.  Notice  that  although  A;\vejs  P^’ 
none  of  its  two  exponential  factors  in  (78)  is  pd.  ■ 

We  now  keep  q  G  [0,2]  but  consider  the  weighted  JT  kernel  family  restricted  to  normalized 
measures,  /^g|(Ari(A’))2-  This  corresponds  to  setting  uniform  weights  (cui  =  uj2  =  1/2);  note  that  in 

this  case  kq  and  kq  collapse  into  the  same  kernel, 

kqiPi^P^)  =  kqiPuP2)  =  lnq(2)  -  Tq{pi,P2).  (79) 

Proposition  30  guarantees  that  these  kernels  are  pd  for  q  G  [0,2].  Remarkably,  we  recover  three 
well-known  particular  cases  for  q  G  {0, 1,  2}.  We  start  by  the  Jensen-Shannon  kernel,  introduced 
and  shown  to  be  pd  by  Hein  et  al.  [2004];  it  is  a  particular  case  of  a  weighted  Jensen-Shannon 
kernel  in  Definition  31. 


Definition  36  (Jensen-Shannon  kernel)  The  kernel  kjs  :  M^{X)  x  M^{X)  — M  is  defined  as 


kjs{p>i,P2)  =  ln2  -  JS{pi,p2). 


Corollary  37  The  kernel  k jg  is  pd. 

Proof:  fcjs  is  the  restriction  of  to  M]^{X)  x  M\{X).  ■ 

Finally,  we  study  two  other  particular  cases  of  the  family  of  Tsallis  kernels:  the  Boolean  and 
linear  kernels. 

Definition  38  (Boolean  kernel)  Let  the  kernel  kg^gi  :  M^'^’^{X)  x  M^°’^{X)  — M  defined  as 
^Bool  ^0’  i-0-, 

kBool  (pi ,  P2 )  =  (supp  (pi )  n  supp  (P2  ) ) ,  (80) 

i.e.,  kgggi{pi,p2)  equals  the  measure  of  the  intersection  of  the  supports  (cf  the  result  (48)).  In 
particular,  if  X  is  finite  and  v  is  the  counting  measure,  the  above  may  be  written  as 

kBooliPuP2)  =  Ibl  ©P2||o-  (81) 
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Definition  39  (Linear  kernel)  Let  the  kernel  knfj  :  x  -^Rbe  defined  as 


klin{PuP2)  =  2  {Pl,P2)- 


(82) 


Corollary  40  The  kernels  kgggi  and  kufi  are  pd. 

Proof:  Invoke  Proposition  30  with  g  =  0  and  q  =  2.  Notice  that,  for  q  =  2,  we  just  recover 
the  well-known  property  of  the  inner  product  kernel  [Scholkopf  and  Smola,  2002],  which  is  equal 
to  /ciin  up  to  a  scalar.  ■ 

In  conclusion,  the  Boolean  kernel,  the  Jensen-Shannon  kernel,  and  the  linear  kernel,  are  simply 
particular  elements  of  the  much  wider  family  of  Jensen-Tsallis  kernels,  continuously  parameterized 
by  g  G  [0,  2].  Furthermore,  the  Jensen-Tsallis  kernels  are  a  particular  subfamily  of  the  even  wider 
set  of  weighted  Jensen-Tsallis  kernels. 

One  of  the  key  features  of  our  generalization  is  that  the  kernels  are  defined  on  unnormalized 
measures,  with  arbitrary  mass.  This  is  relevant,  for  example,  in  applications  of  kernels  on  empirical 
measures  (e.g.,  word  counts,  pixel  intensity  histograms);  instead  of  the  usual  step  of  normalization 
[Hein  et  ah,  2004],  we  may  leave  these  empirical  measures  unnormalized,  thus  allowing  objects 
of  different  size  (e.g.,  total  number  of  words  in  a  document,  total  number  of  image  pixels)  to  be 
weighted  differently.  Another  possibility  opened  by  our  generalization  is  the  explicit  inclusion  of 
weights:  given  two  normalized  measures,  they  can  be  multiplied  by  arbitrary  (positive)  weights 
before  being  fed  to  the  kernel  function. 

7.4  Other  kernels  based  on  Jensen  differences  and  ^-differences 

It  is  worth  noting  that  the  Jensen-Renyi  and  the  Jensen-Tsallis  divergences  also  yield  positive 
definite  kernels,  albeit  there  are  not  any  obvious  “weighted  generalizations”  like  the  ones  presented 
above  for  the  Tsallis  kernels. 


Proposition  41  (Jensen-Renyi  and  Jensen-Tsallis  kernels)  For  any  q  G  [0, 2],  the  kernel 

(Pl,P2)  ^  Sq  2 

and  the  (unweighted)  Jensen-Tsallis  divergence  Jsq  (37)  are  nd  kernels  on  M\(X)  x  M\(X). 
Also,  for  any  q  G  [0, 1],  the  kernel 

(Pl,P2)  ^  Rq  2 

and  the  (unweighted)  Jensen-Renyi  divergence  Jr^  (34)  are  nd  kernels  on  M\(X)  x  M\(X). 
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Proof:  The  fact  that  (^1,^2)  ^  5'^(pi+P2)  results  from  the  embedding  x  ^  x/2 

and  Proposition  27.  Since  (pi,P2)  ^  '5'q(pi)+s’q(p2)  trivially  nd,  we  have  that  Js^  is  a  sum  of  nd 

functions,  which  turns  it  nd.  To  prove  the  negative  definiteness  of  the  kernel  (pi,  P2)  ^  Rq  > 

notice  first  that  the  kernel  {x^y)  1— >  (x  +  p)/2  is  clearly  nd.  From  Lemma  24  and  integrating, 
we  have  that  (pi,p2)  is  nd  for  q  G  [0, 1].  From  the  same  lemma  we  have  that 

(Pi)P2)  I— >  In  t'or  any  t  >  0.  Since  /  >  0,  the  nonnegativity  of 

(Pi)P2)  I— Rq  follows  by  taking  the  limit  t  ^  0.  By  the  same  argument  as  above,  we 

conclude  that  is  nd.  ■ 

As  a  consequence,  we  have  from  Lemma  22  that  the  following  kernels  are  pd  for  any  t  >  0: 

^EJr(Pi,P2)  =  exp  (^-tRq  2  ^  (/  2 

and  its  “normalized”  counterpart, 

/  j  (  Pi+P2A^\ 

%Jr(Pi,P2)  =  exp(-tJK^(pi,P2))  =  }  I  ■  (84) 

\\/fPlfP2/ 

Although  we  could  have  derived  its  positive  definiteness  without  ever  referring  the  Renyi  entropy, 
the  latter  has  in  fact  a  suggestive  interpretation:  it  corresponds  to  an  exponentiation  of  the  Jensen- 
Renyi  divergence;  it  generalizes  the  case  q  =  1  which  corresponds  to  the  exponentiated  Jensen- 
Shannon  kernel. 


Finally,  we  point  out  a  relationship  between  the  Jensen-Tsallis  divergences  (Subsection  4.4) 
and  a  family  of  difference  kernels  introduced  by  Fuglede  [2005], 


faA^^y)  = 


x"^  +  y^ 


Ija 


x^  +  y 


0\  Ry 


(85) 


Fuglede  [2005]  derived  the  negative  definiteness  of  the  above  family  of  kernels  provided  1  <  a  < 
00  and  1/2  <  <  a;  he  went  further  by  providing  representations  for  these  kernels.  Hein  et  al. 

[2004]  used  the  fact  that  the  integration  /  fa,0ix{t),y{t))dT{t)  is  also  nd  to  derive  a  family  of  pd 
kernels  for  probability  measures  that  included  the  Jensen- Shannon  kernel. 

We  start  by  noting  the  following  property  of  the  extended  Tsallis  entropy,  that  is  very  easy  to 
establish: 

Sqip)  =  q~^Sl/q{A)  (86) 

As  a  consequence,  we  have  that 


Js,(yuy^)  = 

'xI+xA^R\  Sr{Xi)  +  Sr{x2) 


=  r 


Sr 


=  rJs^{xi,X2) 


(87) 

(88) 
(89) 
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where  we  made  the  substitutions  r  =  q  xi  ^  yf  and  X2  =  yl,  and  introduced 


JSr{.Xl,X2) 


fx\  +  xl^ 

V  2  . 

'x[  +  0:2/ 

2 

X1+X2 

.  2  J 

2 

(90) 


Since  Js^  is  nd  for  q  G  [0,  2],  we  have  that  Js^  is  nd  for  r  e  [1/2,  cx)]. 

Notice  that  while  Js^  may  be  interpreted  as  “the  difference  between  the  Tsallis  g-entropy  of  the 
mean  and  the  mean  of  the  Tsallis  g-entropies”,  may  be  interpreted  as  “the  difference  between 
the  Tsallis  g-entropy  of  the  g-power  mean  and  the  mean  of  the  Tsallis  g-entropies”. 

From  (90)  we  have  that 


J i^aA^^y)  =  A-^)JsAx:y)  -  i/3-i)Jsfsix,y),  (9i) 

so  the  family  of  probabilistic  kernels  studied  in  Hein  et  al.  [2004]  can  be  written  in  terms  of  Jensen- 
Tsallis  divergences. 

7.5  k-th  order  Jensen-Tsallis  string  kernels 

This  subsection  introduces  a  new  class  of  string  kernels  inspired  by  the  k-th  order  JT  g-difference 
introduced  in  Subsection  6.3.  Although  we  refer  to  them  as  “string  kernels,”  they  are  more  gener¬ 
ally  kernels  between  stochastic  processes. 

Several  string  kernels  (i.e.,  kernels  operating  on  the  space  of  strings)  have  been  proposed  in 
the  literature  [Haussler,  1999,  Lodhi  et  ah,  2002,  Leslie  et  ah,  2002,  Vishwanathan  and  Smola, 
2003,  Shawe-Taylor  and  Cristianini,  2004].  These  are  kernels  defined  on  A*  x  A*,  where  A*  is 
the  Kleene  closure  of  a  finite  alphabet  A  (i.e.,  the  set  of  all  finite  strings  formed  by  characters  in  A 
together  with  the  empty  string  e.)  The  p-spectrum  kernel  [Leslie  et  ah,  2002]  is  associated  with  a 
feature  space  indexed  by  A^  (the  set  of  length-p  strings).  The  feature  representation  of  a  string  s, 
4»(s)  A  (.^(s))  ugyiP,  counts  the  number  of  times  each  u  &  A^  occurs  as  a  substring  of  s, 

Au{s)  =  |{(t'i,'y2)  :  S  =  V1UV2}\.  (92) 

The  p-spectrum  kernel  is  then  defined  as  the  standard  inner  product  in 

klAs,t)  =  .  (93) 

A  more  general  kernel  is  the  weighted  all-substrings  kernel  [Vishwanathan  and  Smola,  2003], 
which  takes  into  account  the  contribution  of  all  the  substrings  weighted  by  their  length.  This 
kernel  can  be  viewed  as  a  conic  combination  of  p-spectrum  kernels  and  can  be  written  as 

00 

kwAsAs,  t)  =  KkA,  t),  (94) 

p=i 
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where  ap  is  often  chosen  to  decay  exponentially  with  p  and  truncated;  for  example,  ap  =  if 
Pmin  <  P  <  Pmax,  and  ttp  =  0,  othcrwisc,  where  0  <  A  <  1  is  the  decaying  factor. 

Both  and  /cwask  are  trivially  positive  definite,  the  former  by  construction  and  the  latter 
because  it  is  a  conic  combination  of  positive  definite  kernels.  A  remarkable  fact  is  that  both  kernels 
may  be  computed  in  0(|s|  +  |f|)  time  (i.e.,  with  cost  that  is  linear  in  the  length  of  the  strings),  as 
shown  by  Vishwanathan  and  Smola  [2003],  by  using  data  structures  such  as  suffix  trees  or  suffix 
arrays  [Gusfield,  1997].  Moreover,  with  s  fixed,  any  kernel  k{s,  t)  may  be  computed  in  time  0(|t|), 
which  is  particularly  useful  for  classification  applications. 

We  will  now  see  how  Jensen-Tsallis  kernels  may  be  used  as  string  kernels.  In  Subsection  6.3, 
we  have  introduced  the  concept  of  joint  and  conditional  IT  ^-differences.  We  have  seen  that  joint 
IT  g-differences  are  just  JT  g-differences  in  a  product  space  of  the  form  X  =  Xi  x  for  k-th. 
order  joint  JT  g-differences  this  product  space  is  of  the  form  =  A  x  A^~^.  Therefore,  they 
still  yield  positive  definite  kernels  as  those  introduced  in  Definition  29,  where  X  =  A^.  The  next 
definition  and  proposition  summarize  these  statements. 

Definition  42  (/c-th  order  weighted  JT  kernels)  Let  XA  (^)  be  the  set  of  stationary  and  ergodic 
stochastic  processes  that  take  values  on  the  alphabet  A.  For  k  E  N  and  q  G  [0,2],  let  the  kernel 
kq^k  '■  (1^-1-  X  y'iA))'^  be  defined  as 

fcg,fc((a;i,si),  (a;2,S2))  =  \{uJiPsi,k,^2Ps2,k)  (95) 

=  {Sl,  S2))  {uJl  +  UJ2y, 

where  ps^^k  and  Ps2,k  are  the  k-th  order  joint  probability  functions  associated  with  the  stochastic 
sources  si  and  S2,  and  n  =  (a;i/(a;i  +  U2),uj2/{uji  +  1^2)). 

Let  the  kernel  kq^k  '■  x  ^{A))^  Rbe  defined  as 

kq^ki^i^^l:  ^1')  1  ^2}')  kq{uJ\Ps^^k)  ^2Ps2,kj  (96) 

Proposition  43  The  kernel  kq^k  A  pd,  for  q  G  [0,  2].  The  kernel  kq^k  A  pd,  for  g  G  [0, 1]. 

Proof:  Define  the  map  g  :  M+  x  y{A)  M+  x  M]j^‘^{A^)  as  {oj,  s)  i— >  gioj,  s)  =  {oj^Ps^k)- 
From  Proposition  30,  the  kernel  kq{g{ijJi,  Si),  g(c<;2,  S2))  is  pd  and  therefore  so  is  fcg,fc((cc;i,  Si),  {102-,  S2)); 
proceed  analogously  for  kq^k-  ■ 

At  this  point,  one  might  wonder  whether  the  “fc-th  order  conditional  JT  kernel”  that  would 
be  obtained  by  replacing  with  in  (95)-(96)  is  also  pd.  Formula  (68)  shows  that  such 

“conditional  JT  kernel”  is  a  difference  between  two  joint  JT  kernels,  which  is  inconclusive.  The 
following  proposition  shows  that  and  are  not  pd  in  general.  The  proof,  which  is  in 
Appendix  C,  proceeds  by  building  a  counterexample. 

Proposition  44  Let  kfjf  be  defined  as  kfk‘^{si,  S2)  =  (^Sq{Tr)  —  S2))  {oJi  -f  IJJ2Y;  and 

kf^  be  defined  as  kfjf{si^  S2)  =  (>S'q('7r)  —  S2)).  It  holds  that  kfjf  and  kfjf  are  not 

pd  in  general. 
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Despite  the  negative  result  in  Proposition  44,  the  ehain  rule  in  Proposition  18  still  allows  us  to 
define  pd  kernels  by  eombining  eonditional  JT  g-differenees. 

Proposition  45  Let  {I3k)k&  be  a  non-increasing  infinitesimal  sequence,  i.e.  satisfying 

/3o  >  /9i  >  . . .  >  /9n  ^  0  (97) 

Any  kernel  of  the  form 

OO 

E  A  Kf  (9S} 

k=0 

is  pd  for  q  E  [0,  2];  and  any  kernel  of  the  form 

OO 

E  A  Kf  (») 

fc=0 

is  pdfor  q  G  [0, 1],  provided  both  series  above  converge  pointwise. 

Proof:  From  the  ehain  rule,  we  have  that  (defining  the  0-th  order  joint  JT  g-differenee  as 

ho  =  0) 

OO  OO  n  OO 

'y  (dk  bq^k  ~  'y  /  (dk  (,bq^k+l  ~  ^  '  ^k  kq^k  T  (dn  kq^ri+1  ~  'd  ^k  kq^k  (100) 

fc=0  k=0  k=l  k=l 

with  ak  =  (dk-i  —  fdk  (the  term  lim  ldnkq,n+i  was  dropped  beeause  0  and  kq^n+i  is  bounded). 
Sinee  {/dk)k&n  is  non-inereasing,  we  have  that  (Q;fc)A:eN\{o}  is  non-negative,  whieh  makes  (100)  the 
pointwise  limit  of  a  eonie  eombination  of  pd  kernels,  and  therefore  a  pd  kernel.  The  proof  for 
fdkkfk"^  is  analogous.  ■ 

Notiee  that  if  we  set  (do  =  ...  =  (dk-i  =  1  and  (dj  =  0,  Vj  >  k,  in  the  above  proposition,  we 
reeover  the  k-th  order  joint  JT  g-differenee. 

Finally,  notiee  that,  in  the  same  way  that  the  linear  kernel  is  a  speeial  ease  of  a  JT  kernel  when 
g  =  2  (see  Cor.  40),  the  p-speetrum  kernel  (93)  is  a  partieular  ease  of  a  p-th  order  joint  JT  kernel, 
and  the  weighted  all  substrings  kernel  (94)  is  a  partieular  ease  of  a  eombination  of  joint  JT  kernels 
in  the  form  (98),  both  obtained  when  we  set  g  =  2  and  the  weights  ui  and  0J2  equal  to  the  length  of 
the  strings.  Therefore,  we  eonelude  that  the  JT  string  kernels  introdueed  in  this  seetion  subsume 
these  two  well-known  string  kernels. 

7.6  The  heat  kernel  approximation 

The  diffusion  kernel  for  statistieal  manifolds,  reeently  proposed  by  Lafferty  and  Lebanon  [2005], 
is  grounded  in  information  geometry  [Amari  and  Nagaoka,  2001].  It  models  the  diffusion  of  “in¬ 
formation”  over  a  statistieal  manifold  aeeording  to  the  heat  equation.  Sinee  in  the  ease  of  the 
multinomial  manifold  (the  relative  interior  of  A”),  the  diffusion  kernel  has  no  elosed  form,  the 
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authors  adopt  the  so-called  “first-order  parametrix  expansion,”  which  resembles  the  Gaussian  ker¬ 
nel  replacing  the  Euclidean  distance  by  the  geodesic  distance  that  is  induced  when  the  manifold 
is  endowed  with  a  Riemannian  structure  given  by  the  Fisher  information  (we  refer  to  Lafferty  and 
Lebanon  [2005]  for  further  details).  The  resulting  heat  kernel  approximation  is 

^heat(Pl,P2)  =  (47rt)“t  exp  d^(Pl,P2))  ,  (101) 

where  t  >  0  and  dg{pi,p2)  =  2  arccos  •  Whether  /cheat  is  pd  has  been  an  open  problem 

[Hein  et  ah,  2004,  Zhang  et  ah,  2005].  Let  be  the  positive  orthant  of  the  n-dimensional  sphere, 
i.e., 

(  n+l 

S+  =  \  {xi, . . . ,  Xn+i)  e  I  '^xl  =  1,  Vi  Xi  >  0 
I  i=l 

The  problem  can  be  restated  as  follows:  is  there  an  isometric  embedding  from  S”  to  some  Hilbert 
space?  In  this  section  we  answer  that  question  in  the  negative. 


Proposition  46  Let  n  >2.  For  sufficiently  large  t,  the  kernel  kheat  is  not  pd. 

Proof:  From  Proposition  22,  kheat  is  pd,  for  all  t  >  0,  if  and  only  if  is  nd.  We  provide  a 
counterexample,  using  the  following  four  points  in  A^:  pi  =  (1,  0,  0),  p2  =  (0, 1,  0),  ps  =  (0,  0, 1) 
andp4  =  (1/2, 1/2,  0).  The  squared  distance  matrix  [Dif\  =  [dfg{pi,pj)]  is 


D 


[  0  A  A  1' 

TT^  4  0  4  1 

T'  4404 
114  0 


(102) 


Taking  c  =  (—4,  —4, 1,  7)  we  have  c^Dc  =  27r^  >  0,  showing  that  D  is  not  nd.  Although 
Pi,P2,P3,P4  lie  on  the  boundary  of  A^,  continuity  of  implies  that  it  is  not  nd  on  the  relative 
interior  of  A^.  The  case  n  >  2  follows  easily,  by  appending  zeros  to  the  four  vectors  above.  ■ 


8  Experiments 

We  illustrate  the  performance  of  the  proposed  nonextensive  information  theoretic  kernels,  in  com¬ 
parison  with  common  kernels,  for  SVM-based  text  classification.  We  performed  experiments 
with  two  standard  datasets:  Reuters-21578^  and  WebKB.'^  Since  our  objective  was  to  evaluate 
the  kernels,  we  considered  a  simple  binary  classification  task  that  tries  to  discriminate  among  the 
two  largest  categories  of  each  dataset;  this  led  us  to  the  earn-vs-acq  classification  task  for  the 
first  dataset,  and  stud-vs-fac  (students’  vs.  faculty  webpages)  in  the  second  dataset.  Two  differ¬ 
ent  frameworks  were  considered:  modeling  documents  as  bags-of-words,  and  modeling  them  as 
strings  of  characters.  Therefore,  both  bags-of-words  kernels  and  string  kernels  were  employed  for 
each  task. 

^Available  at  www .  daviddlewis  .  com/ resources/ testcollections. 

^Available  at  www .  cs  .  emu  .edu/afs/cs  .  emu  .  edu/pro  ject / theo-2  0/www/ data. 
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8.1  Documents  as  bags-of-words 

For  the  bags-of-words  framework,  after  the  usual  preprocessing  steps  of  stemming  and  stop-word 
removal,  we  mapped  text  documents  into  probability  distributions  over  words  using  the  bag-of- 
words  model  and  maximum  likelihood  estimation;  this  corresponds  to  normalizing  the  term  fre¬ 
quencies  (tf)  using  the  fi-norm,  and  is  referred  to  as  tf  [Joachims,  2002,  Manning  and  Schiitze, 
1999].  We  also  used  the  tf-idf  (term  frequencyi^l -inverse  document  frequency)  representation, 
which  penalizes  terms  that  occur  in  many  documents  [Joachims,  2002,  Manning  and  Schiitze, 
1999].  To  weight  the  documents  for  the  Tsallis  kernels,  we  tried  four  strategies:  uniform  weight¬ 
ing,  word  counts,  square  root  of  the  word  counts,  and  one  plus  the  logarithm  of  the  word  counts; 
however,  for  both  tasks,  uniform  weighting  revealed  the  best  strategy,  which  may  be  due  to  the  fact 
that  documents  in  both  collections  are  usually  short  and  do  not  differ  much  in  size. 

As  baselines,  we  used  the  linear  kernel  with  £2  normalization,  commonly  used  for  this  task 
[Joachims,  2002],  and  the  heat  kernel  approximation  (101)  [Lafferty  and  Lebanon,  2005],  which 
is  known  to  outperform  the  former,  albeit  not  being  guaranteed  to  be  pd  for  an  arbitrary  choice  of 
t  (see  (101)),  as  shown  above.  This  parameter  and  the  SVM  C  parameter  were  tuned  by  cross- 
validation  over  the  training  set.  The  SVM-Light  package  (available  at  http:  //svmlight . 
joachims  .  org/)  was  used  to  solve  the  SVM  quadratic  optimization  problem. 

Figs.  2-3  summarize  the  results.  We  report  the  performance  of  the  Tsallis  kernels  as  a  function 
of  the  entropic  index  q.  For  comparison,  we  also  plot  the  performance  of  an  instance  of  a  Tsallis 
kernel  with  q  tuned  by  cross-validation.  For  the  first  task,  this  kernel  and  the  two  baselines  exhibit 
similar  performance  for  both  the  tf  and  the  tf-idf  representations;  differences  are  not  statistically 
significant.  In  the  second  task,  the  Tsallis  kernel  outperformed  the  £2 -normalized  linear  kernel 
for  both  representations,  and  the  heat  kernel  for  tf-idf]  the  differences  are  statistically  significant 
(using  the  unpaired  t  test  at  the  0.05  level).  Regarding  the  influence  of  the  entropic  index,  we 
observe  that  in  both  tasks,  the  optimum  value  of  q  is  usually  higher  for  tf-idf  than  for  tf 

The  results  on  these  two  problems  are  representative  of  the  typical  relative  performance  of  the 
kernels  considered:  in  almost  all  tested  cases,  both  the  heat  kernel  and  the  Tsallis  kernels  (for  a 
suitable  value  of  q)  outperform  the  £2 -normalized  linear  kernel;  the  Tsallis  kernels  are  competitive 
with  the  heat  kernel. 

8.2  Documents  as  strings 

In  the  second  set  of  experiments,  each  document  is  mapped  into  a  probability  distribution  over 
character  p- grams,  using  maximum  likelihood  estimation;  we  did  experiments  for  p  =  3,4,5.  To 
weight  the  documents  for  thep-th  order  joint  Jensen-Tsallis  kernels,  four  strategies  were  attempted: 
uniform  weighting,  document  lengths  (in  characters),  square  root  of  the  document  lengths,  and 
one  plus  the  logarithm  of  the  document  lengths.  For  the  earn-vs-acq  task,  all  strategies  performed 
similarly,  with  a  slight  advantage  for  the  square  root  and  logarithm  of  the  document  lengths;  for 
the  stud-vs-fac  task,  uniform  weighting  revealed  the  best  strategy.  For  simplicity,  all  experiments 
reported  here  use  uniform  weighting. 

As  baselines,  we  used  the  p-spectrum  kernel  (PSK,  see  (93))  for  the  values  of  p  referred  above, 
and  the  weighted  all  substrings  kernel  (WASK,  see  (94))  with  decaying  factor  tuned  to  A  =  0.75 
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tf 


tf-idf 


Entropic  index  q  Entropic  index  q 

Figure  2:  Results  for  earn-vs-acq  using  tf  and  tf-idf  representations.  The  error  bars  represent  ±1 
standard  deviation  on  30  runs.  Training  (resp.  testing)  with  200  (resp.  250)  samples  per  class. 


(which  yielded  the  best  results),  with  p^in  =  P  set  to  the  values  above,  and  pmax  =  cxo.  The  SVM 
C  parameter  was  tuned  by  cross-validation  over  the  training  set. 

Figs.  4-5  summarize  the  results.  For  the  first  task,  the  JT  string  kernel  and  the  WASK  outper¬ 
formed  the  PSK  (with  statistical  significance  for  p  =  3),  all  kernels  performed  similarly  for  p  =  4, 
and  the  JT  string  kernel  outperformed  the  WASK  for  p  =  5;  all  other  differences  are  not  statiscally 
significant.  In  the  second  task,  the  JT  string  kernel  outperformed  both  the  WASK  and  the  PSK 
(and  the  WASK  outperformed  the  PSK),  with  statistical  significance  for  p  =  3, 4,  5.  Furthermore, 
by  comparing  Fig.  3  and  Fig.  5,  we  also  observe  that  the  5-th  order  JT  string  kernel  remarkably 
outperforms  all  bags-of- words  kernels  for  the  stud-vs-fac  task,  even  though  it  does  not  use  or  build 
any  sort  of  language  model  at  the  word  level. 


9  Conclusions 

In  this  paper  we  have  introduced  a  new  family  of  positive  definite  kernels  between  measures,  which 
contain  previous  information-theoretic  kernels  on  probability  measures  as  particular  cases.  One  of 
the  key  features  of  the  new  kernels  is  that  they  are  defined  on  unnormalized  measures  (not  neces¬ 
sarily  normalized  probabilities).  This  is  relevant,  e.g.,  for  kernels  on  empirical  measures  (such  as 
word  counts,  pixel  intensity  histograms);  instead  of  the  usual  step  of  normalization  [Hein  et  ah, 
2004],  we  may  leave  these  empirical  measures  unnormalized,  thus  allowing  objects  of  different 
size  {e.g.,  documents  of  different  lengths,  images  with  different  sizes)  to  be  weighted  differently. 
Another  possibility  is  the  explicit  inclusion  of  weights:  given  two  normalized  measures,  they  can 
be  multiplied  by  arbitrary  (positive)  weights  before  being  fed  to  the  kernel  function.  In  addition. 
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Figure  3:  Results  for  stud-vs-fac. 


we  define  positive  definite  kernels  between  stoehastie  proeesses  that  subsume  well-known  string 
kernels. 

The  new  kernels,  and  the  proofs  of  positive  definiteness,  rely  on  other  main  eontributions  of 
this  paper:  the  new  eoneept  of  g-eonvexity,  for  whieh  we  proved  a  Jensen  q-inequality;  the  eoneept 
of  Jensen-Tsallis  q-difference,  a  nonextensive  generalization  of  the  Jensen-Shannon  divergenee; 
denormalization  formulae  for  several  entropies  and  divergenees. 

We  have  reported  experiments  in  whieh  these  new  kernels  were  used  in  support  veetor  ma- 
ehines  for  text  elassifieation  tasks.  Although  the  reported  experiments  do  not  allow  drawing  strong 
eonelusions,  they  show  that  the  new  kernels  are  eompetitive  with  the  state-of-the-art,  in  some  eases 
yielding  a  signifieant  performanee  improvement. 


A  Proof  of  Proposition  9 

Proof:  The  ease  g  =  1  eorresponds  to  the  Jensen  differenee  and  was  proved  by  Burbea 
and  Rao  [1982]  (Theorem  1).  Our  proof  extends  that  to  g  7^  1.  Let  y  =  (gi, . . .  where 
Vt  =  ivtu  ■  ■  -^ytn)-  Thus 


(y)  =  ^  ^  (yt 


=  E 


t=l 

m 


t=l 


2=1 


E  '^t^iyti)  -  V?  E 


.t=i 


\t=i 
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(103) 


showing  that  it  suffices  to  consider  n  =  1,  where  each  yt  G  [0, 1],  i.e.. 


t=i 


\t=i 


this  function  is  convex  on  [0, 1]™  if  and  only  if,  for  every  fixed  oi, . . . ,  G  [0, 1],  and  bi, . . .  ,bm  G 
M,  the  function 

f{x)  =  T^^q,{ai  +  bix,  ...,am  +  b^x)  (104) 

is  convex  in  {x  e  R  :  at  +  btx  E  [0, 1],  t  =  1, . . . ,  m}.  Since  /  is  it  is  convex  if  and  only  if 

fit)  >  0. 

We  first  show  that  convexity  of  /  (equivalently  of  T^^)  implies  convexity  of  (p.  Letting  ct  = 

at  +  btx, 

m  /  m  \‘^/m  \ 

f{x)=J2^tbtP"{ct)-ij2^t^t]  p"ij2^tCt].  (105) 

t=l  \t=l  /  \t=l  / 

By  choosing  a;  =  0,  a*  =  a  G  [0, 1],  for  t  =  1,  and  bi, . . .  ,bm  satisfying  =  0  in 

(105),  we  get 

m 

/"(O)  =  p”{a)J2^tbt, 

t=i 

hence,  if  /  is  convex,  p''{a)  >  0  thus  p  is  convex. 

Next,  we  show  that  convexity  of  /  also  implies  (2  —  g)-convexity  of  —l/p”.  By  choosing 
a;  =  0  (thus  Ct  =  at)  and  bt  =  7rl~'^ {p” {at))~^ ,  we  get 


/"(O) 


m  2-q 

E  ‘ 


m  .n-2-g  \  2  /  m  N 


1  y. 

(EKLi  T^tat)  p"{at) 


\t=i 


^  ™  TTt  5  \  „  /  ™  \ 

iSH- 


where  the  expression  inside  the  square  brackets  is  the  Jensen  (2  —  g) -difference  of  l/p”  (see 
Definition  8).  Since  p"{x)  >  0,  the  factor  outside  the  square  brackets  is  non-negative,  thus  the 
Jensen  (2  —  g) -difference  of  l/p"  is  also  nonnegative  and  —1/p"  is  (2  —  g) -convex. 

Finally,  we  show  that  if  p  is  convex  and  —l/p"  is  (2  —  g)-convex,  then  f"  >  0,  thus  is 
convex.  Letr*  =  / p" and  St  =  btipx/p" [ct) / then,  non-negativity  of  /"  results 

from  the  following  chain  of  inequalities/equalities: 


o<  E-n  E^n-  E-^^^ 


Vt=l 


Vi=l 


^t=l 


ra  /  m  \ 

=  E  EtTT  E  bhi^'\ct)  -  ( f/  btTTt 


1 


U=1 


(106) 

(107) 

(108) 
(109) 
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where:  (106)  is  the  Cauchy-Schwarz  inequality;  equality  (107)  results  from  the  definitions  of 
and  St  and  from  the  fact  that  rtSt  =  btTTt,  inequality  (108)  states  the  (2  —  g)-convexity  of  — 1/99"; 
equality  (109)  results  from  (105).  ■ 


B  Proof  of  Proposition  13 

Proof:  The  proof  of  (55),  for  q  >  0,  results  from 
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n  /  m  \  Q  m 


q-l 

=  Sq{7l)  + 

<  Sq{7r), 
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2  n  \  m 


t=l 


''-iS 


i=i 
\  9- 


H  (T^tPtjY  -  ^tptj 
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where  the  inequality  holds  since,  for  yt  >  0:  if  q  >  1,  then  J2iyi  <  {Y^iyiV'^  if  g  G  [0, 1],  then 

Y^y^>{Y.y^r■ 

The  proof  that  Tf  >  0  for  g  >  1,  uses  the  notion  of  g-convexity.  Since  X  is  countable,  the 
Tsallis  entropy  is  as  in  (4),  thus  Sq  >  0.  Since  —Sq  is  1-convex,  then,  by  Proposition  7,  it  is  also 
g-convex  for  g  >  1.  Consequently,  from  the  g- Jensen  inequality  (Proposition  6),  for  finite  T,  with 

\T\  =  m, 

(m  \  m 

^tPt  -  ^tSqiPt)  >  0. 

t=l  )  t=l 

Since  Sq  is  continuous,  so  is  Tf,  thus  the  inequality  is  valid  in  the  limit  as  m  — >  oo,  which 
proves  the  assertion  for  T  countable.  Finally,  TJ^((5i, . . . ,  (ii, . . .)  =  0,  where  is  some  degenerate 
distribution. 

Finally,  to  prove  (57),  for  g  e  [0, 1]  and  X  finite. 


Tqipi,.  .  .  ,Pm) 


(m  \  m 

Y.^tPt]  -Y.^tSq{pt) 

t=l  /  t=l 

m  m 

>  Y.  T^tSq{pt)  -  Y  ^tSqiPt) 

t=l  t=l 

m 

=  Yi^t-^t)Sq{Pt) 

t=l 

m 

t=i 

=  Sq{7i)[l  - 


(Ill) 


(112) 

(113) 


where  the  inequality  (111)  results  from  Sq  being  concave,  and  the  inequality  112  holds  since  Tit  — 
TTt  <  0,  for  g  G  [0, 1],  and  the  uniform  distribution  U  maximizes  Sq  (Proposition  10),  with  Sq{U)  = 
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C  Proof  of  Proposition  44 


Proof:  We  show  a  counterexample  with  g  =  1  (the  extensive  case),  tt  =  (1/2, 1/2)  and 

k  =  1,  that  discards  both  cases.  It  suffices  to  show  that  violates  the 

triangle  inequality  for  some  choice  of  stochastic  processes  Si,  S2,  S3  and  therefore  is  a  not  a  squared 
distance;  this  in  turn  implies  that  is  not  nd  and,  from  Proposition  21,  that  the  above  two 

kernels  are  not  pd.  We  define  si,  S2,  S3  to  be  stationary  first  order  Markov  processes  in  a  binary 
alphabet  A=  {0, 1}  defined  by  the  following  transition  matrices,  respectively: 


and 


=  lim 


S2  =  lim 


S's  =  lim 


1  —  e  e 

■  1  o' 

1/4  3/4 

1/4  3/4 

■  3/4  1/4 

■  3/4  1/4  ■ 

e  1  —  e 

0  1 

e  1  —  e 

■  0  1  ■ 

1/4  3/4 

1/4  3/4 

whose  stationary  distributions  are 


(114) 

(115) 

(116) 


and 


,  1 

1 

'  1  ' 

(Ti  =  lim - 

1  4e 

4e 

— 

0 

ao  =  lim - — 

■  4e  ■ 

'  0  ■ 

1  +  4e 

1 

1 

(T3  =  lim 


1 

1 

'1/5' 

-4e 

4-4e 

4/5 

(117) 

(118) 

(119) 


The  matrix  of  first  order  conditional  IT  1-differences  (or  first  order  conditional  Jensen-Shannon 
divergences)  is 


■  0 

0 

l^(i)  1 

■  0 

0 

0.390  ■ 

* 

0 

* 

0 

0.128 

* 

* 

0 

* 

* 

0 

(120) 


which  fails  to  be  negative  definite,  since 


pSr\s^,S2)  +  (121) 


which  violates  the  triangle  inequality  required  for  \J to  be  a  metric. 

Interestingly,  the  0-th  order  conditional  Jensen-Shannon  divergence  matrix  (this  one  ensured  to 
be  negative  definite  because  it  equals  a  standard  Jensen-Shannon  divergence  matrix)  is 


■  0 

1 

■  0 

1 

0.610  ■ 

* 

0 

* 

0 

0.108 

* 

* 

0 

* 

* 

0 

(122) 


From  the  chain  rule  (68),  we  have  that  the  sum  of  the  matrices  (120)  and  (122)  is  the  second  order 
joint  Jensen-Shannon  divergence,  and  therefore  is  also  guaranteed  to  be  negative  definite.  ■ 
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Average  error  rate  (%) 


p=3  p=4 


Figure  4:  Results  for  earn-vs-acq  using  string  kernels  and  p  =  3, 4,  5.  The  error  bars  represent  ±1 
standard  deviation  on  15  runs.  Training  (resp.  testing)  with  200  (resp.  250)  samples  per  class. 
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Figure  5:  Results  for  stud-vs-fac  using  string  kernels. 
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